datacommonsorg / mixer

Mixer provides the translator engine and API interface to access Data Commons graph
https://datacommons.org
Apache License 2.0
12 stars 32 forks source link

[BUG] Two series' in an import can only differ by `dcAggregate/` #862

Open pradh opened 2 years ago

pradh commented 2 years ago

When two series in an import differ only by dcAggregate/, it seems the Mixer might only pick one of them, because the metadata hash does not include is_dc_aggregate

This happens with the Census PEP imports because they stitch together multiple historical CSVs into an import, and for some year ranges the data isn't available and need to be aggregated (dcAggregate/). Currently, only for those aggregated years, they set dcAggregate/.

Validation:

./check_bt d/3/country/USA^Count_Person_Male frequent | ./cache_parse returns two series with import name USCensusPEP_By_Sex_Race

curl -X POST 'https://autopush.api.datacommons.org/stat/all' -d '{ "places": ["country/USA"], "stat_vars": ["Count_Person_Male"]}' | jq returns only see one series

shifucun commented 2 years ago

A few general questions and thoughts on aggregated data:

1) When do we want to expose "is_aggregate" for an observation? 2) If we claim an observation is aggregated, should it be in the import_name or in the measurement_method? 3) If in one import, there are both aggregated and non-aggregated data, should we present them uniformly or is it necessary to differentiate to the users?

Right now, we handle aggregation data as separate import(series). In some cases, this is not user friendly, ex, City level data (raw) and County level data (aggregated) are presented as two distinct series, which from user perspective is unnecessary.

In case of this bug, it's even more subtle for the aggregation mechanism and I doubt we should expose the complexity in the final data presentation.

A non-intrusive way would be to add a metadata property for the an import and indicating what place types, variables are aggregated. If users do need to figure out the subtlety, they can look up for it from this metadata.