CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.12k stars 18.39k forks source link

Problematic to parse daily data #1199

Open mok0 opened 4 years ago

mok0 commented 4 years ago

I have parsed the data from the daily .csv files incsse_covid_19_data/csse_covid_19_daily_reports and found a ton of problems.

Many countries have varying names, for example "Republic of the Congo" and "Congo (Brazzaville)" both appear, although it's the same country, and South Korea appears as both "Republic of Korea" and "Korea, South". There are many instances of this, as well as misspelling of country names ("Cabo Verde").

Dates appear in 3 different formats , 1/22/2020, 2020-03-20T14:43:04, and 1/30/20 16:00.

I suggest you add the ISO3166 ALPHA-3 country codes so the data is easily interpretable, and also use a standard date format that is easier to parse... the ISO format 2020-03-20T14:43:04 used some of the time is fine.

Thanks for your great work, it is a huge amount of work and much appreciated!

mok0 commented 4 years ago

This issue is related to #1125 but that one does not mention the date formatting problem.

AdamDanischewski commented 4 years ago

If you grab the raw json arcgis realtime data I think it's already formatted for you: http://scriptsandoneliners.blogspot.com/2020/03/covid-19-data-tracking.html

ivanMSC commented 4 years ago

Cabo Verde is an alternative name, not a typo.

rufuspollock commented 4 years ago

We prepped a cleaned up version of the dataset that addresses some of these issues here a couple of weeks back (it gets updated nightly from this repo):

https://github.com/datasets/covid-19

It fixes the dates and consolidates the 3 separate time series in one single file. We also plan to normalize country codes (and maybe normalize out) - see https://github.com/datasets/covid-19/issues/1