CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

QUIT CHANGING THE DATA FORMAT #2146

Open stephenobrochta opened 4 years ago

stephenobrochta commented 4 years ago

How many hacks do we need to code in??? This is ridiculous.

% for the first 61 days because they changed their data format on day 62 files = dir('*.csv'); for i = 1:60

...

% From day 62 after they changed the format for i = 61:length(files)

Now from i = 82, 04-12-2020.csv has yet another format.

From now on, I'm getting my data from the EU. https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

dhimasSkylar commented 4 years ago

Great! new source.

THIS SOURCE IS STUPID.

Lucas-Czarnecki commented 4 years ago

Dear @CSSEGISandData a lot of us really appreciate the hard work you put into making these data public each and every day. But please consider improving how you communicate with the github community. Letting us know what changes you are planning to make, and why, would go a long way.

CSSEGISandData commented 4 years ago

My apologies. There was an error in the push this evening associated with the dashboard updates. We're working on the fix now and it will be corrected as soon as possible.

CSSEGISandData commented 4 years ago

Thank you for your patience. The daily update should now have the correct format:

FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
rkellyuc commented 4 years ago

Unfortunately, commit e53209b reverts back to %m/%d/%y %H:%M date format. This is an oddball format that appears in only a few of the daily updates. Two tips:

  1. Stop using Excel to process data
  2. Set up data quality tests that have to pass before publishing
CSSEGISandData commented 4 years ago

Richard,

The error was mine due to manual correction of today's file. Tomorrow's push will not include this error. I believe both the zero padding and date format issues should be resolved. If I have missed anything, please let me know. My apologies again.

cipriancraciun commented 4 years ago

For those interested I've derived and augmented the JHU, ECDC and NY Times datasets in a compatible format (where I patch the code to solve these inconsistencies). I've described this in #1281 and it's also available at https://github.com/cipriancraciun/covid19-datasets

Now regarding the ECDC, although their format is OK and consistent, their data does seem to differ quite a lot from JHU in a few places... (I can't say which is "right", just that there are differences.)

amalic commented 4 years ago

Thank you for your patience. The daily update should now have the correct format:

FIPS Admin2 Province_State Country_Region LastUpdate Lat Long Confirmed Deaths Recovered Active Combined_Key

Thanks!

Maybe you could do a plausibility check in the future before you upload the data?

MelbourneDeveloper commented 4 years ago

@stephenobrochta I admit that changing the formats makes it challenging to process the data. I've gotten around that problem by creating a reliable process to import the data in to an SQLite database. That makes it possible to query the database in a consistent way. The is the project: https://github.com/MelbourneDeveloper/COVID-19-DB

MelbourneDeveloper commented 4 years ago

@rkellyuc my project processes all the CSV files into an SQLite database. I am building data checking in to the process so that at the end of every database generation, I am left with markdown files that log all the anomalies. I'm hoping to get more people to help with this process. Here is an issue that was generated from the data checking.

ghost commented 2 years ago

Data from Jan 23, 2020 to April 6, 2020 still has MM/DD/YY dates while the rest of the data has YYYY-MM-DD.

Reproducing:

$ cd COVID-19/csse_covid_19_data/csse_covid_19_daily_reports

$ grep --no-filename -P "/20 " *.csv | head -n 3
Anhui,Mainland China,1/23/20 17:00,9,,
Beijing,Mainland China,1/23/20 17:00,22,,
Chongqing,Mainland China,1/23/20 17:00,9,,

$ grep --no-filename -P "/20 " *.csv | tail -n 3
,,W.P. Putrajaya,Malaysia,4/6/20 23:22,2.9264,101.6964,41,1,12,28,"W.P. Putrajaya, Malaysia"
,,Unknown,Malaysia,4/6/20 23:22,,,0,0,0,0,"Unknown, Malaysia"
,,,Tonga,4/6/20 23:22,-21.179,-175.1982,0,0,0,0,Tonga

$ cat *.csv | grep -c "/20 "
22519

$ grep --no-filename '2020-' *.csv | head -n 3
,,Diamond Princess,Canada,2020-12-21 13:27:30,,,0,1,0,0,"Diamond Princess, Canada",,
,,Grand Princess,Canada,2020-12-21 13:27:30,,,13,0,13,0,"Grand Princess, Canada",,0.0
80001,Out of AL,Alabama,US,2020-12-21 13:27:30,,,0,0,0,0,"Out of AL, Alabama, US",,

$ grep --no-filename '2020-' *.csv | tail -n 3
99999,,Grand Princess,US,2020-08-04 02:27:56,,,103,3,,,"Grand Princess, US",,2.912621359223301
80023,Out of ME,Maine,US,2020-08-07 22:34:20,,,0,0,,,"Out of ME, Maine, US",,
90051,Unassigned,Virginia,US,2020-12-21 13:27:30,,,0,0,,,"Unassigned, Virginia, US",,

$ cat *.csv | grep -c '2020-' 
1055327

Example of one way to fix this:

Install Python csvkit, see https://csvkit.readthedocs.io/en/1.0.6/index.html

cd COVID-19/csse_covid_19_data/csse_covid_19_daily_reports
grep -l '/20 ' *.csv | while read f; do csvjson -I $f | in2csv -f json | sed 's/\([0-9]\)T\([0-9]\)/\1 \2/' > /tmp/tmp.csv && mv /tmp/tmp.csv $f; done