Spelman-College / spelman-dashboard

Apache License 2.0
0 stars 0 forks source link

Map data set variable permutations to dcids #53

Open cernst122 opened 1 year ago

cernst122 commented 1 year ago

Each data set from our final list supports a number of overlapping filters. For example, table 1 supports race, sex, and degree level. Therefore the user may query:

Total_{EducationalAttainment}_{Gender}_{Race} data where any variable may be null. Some examples include

We need to associate all possible filter combos with the corresponding DataCommons dcid. Most of the dcids we care about for the dashboard are not yet present in DataCommons; we're adding them as part of this project. The new dcids can be found in the processed_csv tab of a completed sheet (example).

Typically our new variables follow deterministic patterns using English words associated with the variables, e.g. Count_Person_ScienceAndEngineeringRelatedMajor_EducationalAttainmentDoctorateDegree_HispanicOrLatino_Male_Tenured. However, other variables already in DataCommons sometimes use unique identifiers such as dc/t7403chwvspm (Bachelors Degree or Higher, Female, Black or African American Alone.) For this reason we may need to map filter selections to hardcoded dcids rather than always dynamically generating dcids based on filter selections.

DataCommons enforces alphabetical order when constructing dcids (e.g. bachelorsDegreeMajor < educationalAttainment < ethnicity < gender < tenureStatus in the example above).

isaac-looker commented 1 year ago

For the current list of 'Confirmed Uploads' in #21, it appears that all of the DCIDs are essentially composed of arrays of mutually exclusive
categorical variables, ie., <prefix>_<category A>_<category B>_ ..., where each category can have 0 or more variables.

I'm not seeing DCIDs, regarding intersections of multiple categories, does this mean we're summing multiple DCIDs? For instance, if we want to see all genders graphed, I'm assuming we'll make a call for each and
sum the counts for the visualization- is this accurate?

cernst122 commented 1 year ago

That's right, we won't see intersections across e.g. GenderEnums because they are mutually exclusive. DC team recommends the series API call on entity country/USA; we'll make separate calls for each separate permutation of variables as you said.

More info on the limitations of the series API is in #14. I don't think you should run into the bug referenced there because we'll be working entirely with our new dcids.