CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.14k stars 18.46k forks source link

A Lookup Table with UID, ISO and FIPS Uploaded #1791

Open CSSEGISandData opened 4 years ago

CSSEGISandData commented 4 years ago

We newly uploaded a lookup table including UID, ISO alpha 2, ISO alpha 3, ISO 3-digit, FIPS, and location names used in our Dashboard and GitHub. The UID is a combination of ISO numeric code, FIPS, and some artificial numbers. This table will be updated along with more locations joining in. Thanks!

Update 3/31: Add UID for US states and out of state entries. Adjust UID for unassigned locations in the US.

clyde7 commented 4 years ago

Thank you so much!

chrisdane commented 4 years ago

the longitude column is called Long_?

greg-minshall commented 4 years ago

very nice -- thanks!

gkrag commented 4 years ago

Padding values in FIPS code US territories looks like are flipped between daily case counts and lookup table files. For e.g. below are the FIPS codes in lookup table, 00060 00066 00069 00078 and these are the values in case counts file, 66000 60000 69000 78000

CSSEGISandData commented 4 years ago

@gkrag Good catch and yes, they are the same. We first defined them as 66000 and then changed to the 00060 format.

tautme commented 4 years ago

What is the FIPS code version you are using? FIPS 6-4? INCITS 631?

oltdaniel commented 4 years ago

@CSSEGISandData You have uploaded the CSV table and I think this is enough, so close this one and link it better in the README. 😉

nbuonin commented 4 years ago

I think the padding for FIPS codes with a leading zero is off for the US time series files. In the lookup table these counties do have the leading zero, but in the time series files they don't. They are also represented as floats regardless of the leading zero.

For example Autauga County has a FIPS of 01001 in the lookup table: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv#L417

But it has a FIPS of 1001.0 in this time series file: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv#L7

payamazadi commented 4 years ago

can i ask why there is a separate table for this?.. is it because of folks who have hardcoded their systems against the existing column definitions? this just points to the need of creating an API layer for users..

matthiasgeihs commented 4 years ago

Did you think about adding population values?

greg-minshall commented 4 years ago

@payamazadi having skinnier daily (changing) tables makes for less data download, etc. the UID/etc. table is presumably pretty static.

@matthiasgeihs the static nature of the UID table, and its "authoritariness/arbitrariness" maybe differentiates it from things like population size, which a) varies over time, and b) varies depending on your chosen authority.

my two cents. cheers!

tautme commented 4 years ago

Population data and more at data.census.gov

texadactyl commented 4 years ago

@CSSEGISandData That CSV went immediately into my database. Thanks!

jessehamner commented 4 years ago

the longitude column is called Long_?

If, for instance, one tries to use a variable named long in an *SQL, it won't be allowed, because long is a data type and is reserved. So Long_ or lon avoids the issue.

texadactyl commented 4 years ago

Translate, reorder, and/or filter the labels. Good tool for Python programs: Pandas read_csv() and DataFrames . Javascript, Java, C/C++, etc. all have equivalents.

Free unsolicited advice: Don't get tied down to CSV column headers. Make the CSV fit your design through translation.

MatthiasBlume commented 4 years ago

The population of "New York City, New York, US" is off by about 2.7 million. The value should include all five boroughs since that is how you report cases in New York City and per Exception Type 2 in https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/README.md. The number currently in the file is 5,803,210. Per a 2016 U.S. Census Bureau report, the value is 8,537,673. I suspect that someone transposed the first two digits in your CSV file, so I am proposing the change to 8,503,210.

Lucas-Czarnecki commented 4 years ago

It appears that a recent commit has changed how the Lookup Table encodes FIPS. Recent changes omit leading zeros for the following states (those with a FIPS between 01 and 09):

Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut

Consequently, it appears that all counties for these states are also affected (e.g., Autauga Alabama is 1001, but should be 01001). I am wonder whether this is also causing problems in JHU's workflow; for example, see issue #2183.

leboblack commented 4 years ago

Thank you

texadactyl commented 4 years ago

@Lucas-Czarnecki True for the last 2 reports (April 13th and 14th). Technically, you have a point. But, why do you think it is causing trouble? Extracting the FIPS values with or without being filled out to 5 digits is straight-forward in any programming language (E.g. Python, R).

This file is a feed to further processing. The only time an end user would care about the value formatting is in presentation (report or plot).

Lucas-Czarnecki commented 4 years ago

Hey @texadactyl, It's hard to say given how JHU updates and modifies its daily reports (excel, R, all of the above... who knows). How I found out about this error (and it is an error), was using the Lookup table to map FIPS codes to older daily reports.

I don't want to speculate about all the ways that folks might end up using the Lookup Table and the FIPS codes, but updating this csv with accurate information would be ideal.

texadactyl commented 4 years ago

@Lucas-Czarnecki The FIPS county code values are accurate (I've seen no value errors). Their format in the CSV handoff file is not suitable for reporting - agreed. My advice is to process/scrub/filter/extract using your own programs (E.g. Python Pandas read_csv) any way you like; store the results in your own database. Don't expect JHU to address format-related issues too quickly as they are up to their ears.

Lucas-Czarnecki commented 4 years ago

@texadactyl Fair enough. Yes, that's what I did. Agreed, I'd say this is at the bottom of JHU's priorities.

texadactyl commented 4 years ago

@Lucas-Czarnecki Hopefully, I wasn't lecture-ish. I've been programming since punched cards & paper tape and do not mind addressing any data challenge that presents itself.

Lucas-Czarnecki commented 4 years ago

@texadactyl Not at all. You came off very nice. I always appreciate the info!

jessehamner commented 4 years ago

Agreed this sort of issue should be near the bottom of the priorities list. Here's what I do:

For the third CSV format type (starting 2020-Mar-22), in R, one can ensure the leading zero gets included in FIPS codes with, for instance:

covid3$newfips <- sprintf("%05.0f", as.integer(covid3$FIPS))

and to separate the 5-character FIPS code into state and county: covid3$stfips <- substr(covid3$newfips, 1,2) covid3$cofips <- substr(covid3$newfips, 3,5)

Lucas-Czarnecki commented 4 years ago

Thanks @jessehamner. I'm taking a slightly different approach. I modify the Lookup Table and then map the data to each csv. The reason is to avoid issues like this one #2116

I also work in R. Here is my code (which requires the tidyverse). It has the added benefit of avoiding having to create new columns. Capture

Replaces blank cells with missing values. Lookup_Table <- Lookup_Table %>% mutate_all(na_if,"")

Conditionally pads leading zeros (by state and county). Lookup_Table$FIPS <- ifelse(is.na(Lookup_Table$Admin2), formatC(as.numeric(Lookup_Table$FIPS),width=2,format='f',digits=0,flag='0'), formatC(as.numeric(Lookup_Table$FIPS),width=5,format='f',digits=0,flag='0'))

The only kink I can find is having to fix missing values. y$FIPS[y$FIPS == " NA" | y$FIPS == "NA"] <- NA

EDIT: You can find my cleaned Lookup Table HERE

Pinkmei commented 4 years ago

We newly uploaded a lookup table including UID, ISO alpha 2, ISO alpha 3, ISO 3-digit, FIPS, and location names used in our Dashboard and GitHub. The UID is a combination of ISO numeric code, FIPS, and some artificial numbers. This table will be updated along with more locations joining in. Thanks! Update 3/31: Add UID for US states and out of state entries. Adjust UID for unassigned locations in the US.

Thank you for updating. Do you have df_Counties2020.csv file, which might be in US/Data_tables/JHU_USCountymap folder? I incidentally deleted from my computer when updating the analysis. Now couldn't open my analysis without this file. Is there any possibility to help? Thank you very much! Best Regards and blessings