amitp06 / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
1 stars 0 forks source link

Imputation of Missings? #2

Closed aksomers closed 3 years ago

aksomers commented 3 years ago

These notes are in the code (haven't pushed yet).

# Note - NAs appear to be related to Google mobility data
# Google says treat the "gaps" as true unknowns and don't assume it means places weren't busy 
# https://support.google.com/covid19-mobility/answer/9825414?hl=en&ref_topic=9822927
# Changes are measured in percent changes from baseline
# Consider an imputation based on that day of the week from surrounding weeks?
# or treat the NAs as a separate category (ies)?

Other missing NAs are related to FIPs. It looks like Mass combined two counties which is causing problems (Dukes and Nantucket) and some other missing FIPS are related to correctional facilities. Kansas City, MO is a problem. Bear River, UT is a problem. I think we can fix these one offs later, just good to note?

Since you joined on FIPS, these missing FIPs have got to be causing some join problems somewhere, right? Unless there's only one missing per state? Probably something else to check.

amitp06 commented 3 years ago

There was some unexpected behavior in the dplyr join. It joined NA with NA for FIPS. I don't think the results of that join are valid. I'll message you in chat regarding how we'll handle this case.

aksomers commented 3 years ago

Sounds good.

amitp06 commented 3 years ago

Just to document: most of the NA FIPS I saw didn't have any associated county string either, so there is no good place to attach that data. The solution I committed for now is to add the argument na_matches='never' to the joins. We will just have to be careful not to over-generalize our results since we are missing some smaller counties.