CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

Inconsistent spacing in US time series Combined_Key column #2440

Open caleb-lindgren opened 4 years ago

caleb-lindgren commented 4 years ago

Hey, a big thanks to @CSSEGISandData for all the amazing work you're doing! I found a small issue in the US time series tables (confirmed cases and deaths), and I wanted to report it just so you don't have to find it yourself later. But I don't think it's a huge rush.

Here's what I found: For some counties, the spacing in the Combined_Key column differs in these files:

when compared to the csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv file. In all tables, the standard format for US counties in the Combined_Key column "county, state, US". All counties in the location lookup table follow this format. However, in the two US time series tables, the following counties are missing the spaces after each comma; in other words, they follow the format "county,state,US". It's a very small difference, but it causes problems when you try to join the location lookup table to one of the time series tables and use Combined_Key as your join key. These are the counties:

Anchorage,Alaska,US Fairbanks North Star,Alaska,US Kenai Peninsula,Alaska,US Matanuska-Susitna,Alaska,US District of Columbia,District of Columbia,US DeSoto,Florida,US McDuffie,Georgia,US LaSalle,Illinois,US Carroll,Indiana,US LaSalle,Louisiana,US Fillmore,Minnesota,US Lac qui Parle,Minnesota,US Jasper,Missouri,US Alexander,North Carolina,US McKean,Pennsylvania,US DeKalb,Tennessee,US Washington,Utah,US Weber,Utah,US Manassas,Virginia,US Dukes and Nantucket,Massachusetts,US Kansas City,Missouri,US

Thanks for all your work, and I hope this is helpful in the future!

lojic commented 4 years ago

@caleb-lindgren not sure if this will help you, but instead of using that field, I use a 2-tuple consisting of (state, county) as the key to identify the same row for deaths & confirmed.

caleb-lindgren commented 4 years ago

@lojic Thanks for the tip! I'm using the Combined_Key column because I wanted a column that would work whether I'm using the US table or the global table, and unfortunately the global table doesn't have the Admin2 column like the US table does. But what I ended up doing is just getting rid of all the spaces in the Combined_Key column whenever I load the table, e.g.

df["Combined_Key"] = df["Combined_Key"].str.replace(" ", "")

if you're using Python and pandas. Hopefully either your solution or my solution works for anyone else who runs into problems with this.

kevinp2 commented 4 years ago

I can confirm this and for exactly the names listed above.

bobromeo commented 4 years ago

Still happening for: District of Columbia, District of Columbia, US And: Northwest Territories, Canada

kevinp2 commented 4 years ago

Still happening for: District of Columbia, District of Columbia, US And: Northwest Territories, Canada

I'm not using Python, but @caleb-lindgren's suggestion above worked for me. I created a new data field that I named Location Key and did a replace of ", " with just "," and that was a sufficient workaround for my purposes.