CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.14k stars 18.44k forks source link

Mixed files: csse_covid_19_daily_reports/03-23-2020.csv contains US data by county, previous days data completely different. #1441

Open brazirl opened 4 years ago

brazirl commented 4 years ago

Mixed files: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/03-23-2020.csv contains US data by county. Other files in this directory contain other countries data, but not US data by county.

simonnuss commented 4 years ago

Agreed, the latest update is a completely different schema. This is a critical error

FredSchuller commented 4 years ago

I also agree, the number of columns must stay the same for all files. And the size of the daily file becomes huge when there's one line per US county.

jonisapp commented 4 years ago

@JordanMarr I can understand where people are upset because these data are very important today and it's as if the the dataset was meant to be as inconsistent as possible... It's a lot of daily work. Maybe a good solution would be to ask the mainteners to take it into account...

JordanMarr commented 4 years ago

@JordanMarr I can understand where people are upset because these data are very important today and it's as if the the dataset was meant to be as inconsistent as possible... It's a lot of daily work. Maybe a good solution would be to ask the mainteners to take it into account...

I also understand why people are upset. I was upset after time spent troubleshooting my own report this morning.

My "is that tone really helpful" comment was meant to be more of a "remember to breath" reminder.
A mistake is a mistake. :)

jonisapp commented 4 years ago

@JordanMarr I agree !

thoughtafter commented 4 years ago

I wrote import code for these files today which handles the different formats. The people I'm working with want the US county level data. I think a better strategy would have been to leave all existing fields the same and add new fields for new data. If there is a reason for the change in field names and order one option would be to maintain 2 sets of files. It wouldn't be hard to keep the data in the new format and convert to the old (see #1458). Likewise it wouldn't be difficult to convert the older format to new so people could use either set depending on their needs. Still, I get the feeling that there are not enough brains and minds currently on hand to respond to all the needs.