PolicyEngine / openfisca-us-data

Python package to standardise loading input datasets to OpenFisca-US.
4 stars 3 forks source link

.fillna(0) used in RawCPS, should probably be used in CPS #59

Closed baogorek closed 2 years ago

baogorek commented 2 years ago

As I'm working on my own raw class functionality, I'm looking for conventions to follow. However, I'm wondering whether to follow the conventions of lines 47, 49 and 51 of openfisca_us_data/datasets/cps/raw_cps.py, which fills in missing data with 0s as so:

storage["person"] = person = pd.read_csv(f).fillna(0)

I assume a processing operation like this would be better suited to the CPS method rather than RawCPS. If not, just let me know the reasoning behind putting it here as I'm making similar decisions for the CE survey.

Edit: Ah I'm seeing the functions below that take sums and probably need those 0s filled in. I guess I'm still struggling with what processing goes in Raw.

MaxGhenis commented 2 years ago

Raw does minimal processing, basically just separating out the entities and assigning keys. I think we have fillna as part of it for assigning keys. BTW, usually these survey datasets have special values for actual null values, e.g. 999999, so setting nulls to zero doesn't lose information, but if you see exceptions then we can change the approach.

Here are some other examples from openfisca-uk-data: