CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.16k stars 18.47k forks source link

“Daily Reports” Column Description/Data Dictionary Incorrect #2597

Closed troymartinhughes closed 3 years ago

troymartinhughes commented 4 years ago

I’ve identified four separate issues here with this page (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data):

  1. The “Field description” region lists fields that do not exist (nor have they ever existed) in the CSV files (e.g., US Testing Rate, US Hospitalization Rate)
  2. The “Field description” Incidence_Rate column has an incorrect definition (“Admin2 + Province_State + Country_Region), which should instead correspond to the Combined_Key variable
  3. The Combined_Key variable is now missing from the “Field description” section
  4. Finally, for anyone new to these data, there is no reference to the four historical changes that have occurred in these columns over time, most recently on May 21 when two new columns (i.e., Incidence_Rate, Case-Fatality_Rate) were added

Thank you to whoever is able to make these changes, and especially to JHU for their continued support!

For reference, the following four distinct column configurations and naming conventions occur in the CSV files in this folder (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports): Columns included in CSV files dated 01-22-2020 to 03-01-2020: Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered Columns included in CSV files dated 03-01-2020 to 03-22-2020: Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude Columns included in CSV files dated 03-22-2020 to 05-29-2020: FIPS,Admin2,Province_State,Country_Region,LastUpdate,Lat,Long,Co nfirmed,Deaths,Recovered,Active,Combined_Key Columns included in CSV files dated 05-29-2020 to 05-30-2020: FIPS,Admin2,Province_State,Country_Region,LastUpdate,Lat,Long,Co nfirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-F atality_Ratio

Column Mapping

Finally, assuming these changes are made at some point, I've memorialized the current "Field description" section below:

For reference, on 05-31-2020, the posted metadata state the following:

  1. Field description • FIPS: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA. • Admin2: County name. US only. • Province_State: Province, state or dependency name. • CountryRegion: Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State. • Last Update: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC). • Lat and Long: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state. • Confirmed: Confirmed cases include presumptive positive cases and probable cases, in accordance with CDC guidelines as of April 14. • Deaths: Death totals in the US include confirmed and probable, in accordance with CDC guidelines as of April 14. • Recovered: Recovered cases outside China are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from COVID Tracking Project. • Active: Active cases = total confirmed - total recovered - total deaths. • Incidence_Rate: Admin2 + Province_State + Country_Region. • Case-Fatality Ratio (%): = confirmed cases per 100,000 persons. • US Testing Rate: = total test results per 100,000 persons. The "total test results" is equal to "Total test results (Positive + Negative)" from COVID Tracking Project. • US Hospitalization Rate (%): = Total number hospitalized / Number confirmed cases. The "Total number hospitalized" is the "Hospitalized – Cumulative" count from COVID Tracking Project. The "hospitalization rate" and "hospitalized - Cumulative" data is only presented for those states which provide cumulative hospital data.
GitRobGit commented 4 years ago

Thank you for (1) Providing this information! (2) Validating that I am not crazy or missing a former data format change notice

If columns are going to be added my thoughts are that (a) the dictionary is updated (b) people are notified, ideally with time to make code changes (c) former files are updated with "not yet implemented" type values OR the old format is maintained alongside the new format going forward. Yet I know things are moving fast and do appreciate that JHU is providing this information.

troymartinhughes commented 4 years ago

You're not crazy; I trust that JHU is doing the best they can to keep the data updated, but some of the documentation is lacking or has been outpaced by changes in the data.

I run daily quality control reports (100% automated), which both identify and track many of these issues; today's are attached here. JHU_COVID-19_Daily_Reports_US_Data_Quality_Report_20200602.pdf JHU_COVID-19_Daily_Reports_US_Longitudinal_Data_Quality_Report_20200602.pdf

Lucas-Czarnecki commented 4 years ago

Hi folks. You're not crazy. These have been issues for a long time (JHU are amazing and probably too busy to address them all). Based on your comments, you may be interested in a cleaned repo of the JHU data, which I maintain HERE

CSSEGISandData commented 3 years ago

Hello @troymartinhughes! Thanks very much for pointing out those errors. We have updated the description. Please let us know if anything else need to be adjusted. Sorry for the late response.