CHAD-Analytics / CHAD

COVID-19 Health Assessment Dashboard
MIT License
13 stars 5 forks source link

IHME Georgia Duplicates #54

Open levilott opened 4 years ago

levilott commented 4 years ago

FYSA: New IHME Reference_hospitalization_all_locs file has two entries for all dates for Georgia. Not sure if this is mistake on their part or if they went back and added the country of Georgia. I'll keep folks updated if I figure out an effective way to distinguish between them. One of the subsets has mostly "0" observations in predictions.

Also I had to update the file name when pulling the new Reference_hospitalization_all_locs:

IHME_Path <- paste(IHME_Path, "/Reference_hospitalization_all_locs.csv", sep = "")

Let me know if there's anything I can do to support!

galarcon0308 commented 4 years ago

Thanks a lot! I suspected they messed up the country and state. I meant to look through our code too, because I suspect the mistake could be on my end also. I already made a similar mistake once.

levilott commented 4 years ago

Definitely don't think its a mistake on your end. I think they've just updated this file recently to include two "Georgia's". In the csv pulled directly from their zipfile I could filter to Georgia and always see two entries for each date. The first appearing Georgia remains "0" for most of the variables across time and at this point I'm operating under the assumption that these rows represent the Country of Gerogia.

This is working for me right now. I know this code is probably super messy and has my personal variable names but I think the idea behind it should work for your team as well. Happy to jump on a teams meeting tomorrow if I can elaborate more.

##########Short Version with just the Georgia relevant bits###############

NEED TO CORRECT GEORGIA ROWS TO KEEP GEORGIA US STATE AND DROP GEORGIA COUNTRY

BASIC IDEA IS GOING TO BE TO:

PULL OUT GEORGIA AND STORE IN NEW GEORGIA SPECIFIC DATA FRAME

REMOVE ODD ROWS / KEEP EVEN ROWS IN THIS NEW DATA FRAME

DROP ALL GEORGIA ROWS FROM ORIGINAL DATA FRAME

APPEND THE ORIGINAL DATA FRAME WITH THE EVEN ROWS WE KEPT IN THE NEW GEORGIA SPECIFIC DATA FRAME

IHME_Georgia_data <- IHME_data[IHME_data$location_name == "Georgia", ] #create Georgia Specific data frame

Drop_Rows <- seq(1, nrow(IHME_Georgia_data), 2) #country values appear to be odd rows, drop these rows IHME_Georgia_data <- IHME_Georgia_data[-Drop_Rows, ]

IHME_data <- IHME_data[!(IHME_data$location_name == "Georgia"), ] #drop all Georgia rows from original data frame IHME_data <- rbind(IHME_data, IHME_Georgia_data) #add back just the Georgia State rows from Georgia Specific data frame

IHME_data <- arrange(IHME_data, location_name) #alphabetize