Open mok0 opened 4 years ago
This issue is related to #1125 but that one does not mention the date formatting problem.
If you grab the raw json arcgis realtime data I think it's already formatted for you: http://scriptsandoneliners.blogspot.com/2020/03/covid-19-data-tracking.html
Cabo Verde is an alternative name, not a typo.
We prepped a cleaned up version of the dataset that addresses some of these issues here a couple of weeks back (it gets updated nightly from this repo):
https://github.com/datasets/covid-19
It fixes the dates and consolidates the 3 separate time series in one single file. We also plan to normalize country codes (and maybe normalize out) - see https://github.com/datasets/covid-19/issues/1
I have parsed the data from the daily .csv files in
csse_covid_19_data/csse_covid_19_daily_reports
and found a ton of problems.Many countries have varying names, for example "Republic of the Congo" and "Congo (Brazzaville)" both appear, although it's the same country, and South Korea appears as both "Republic of Korea" and "Korea, South". There are many instances of this, as well as misspelling of country names ("Cabo Verde").
Dates appear in 3 different formats , 1/22/2020, 2020-03-20T14:43:04, and 1/30/20 16:00.
I suggest you add the ISO3166 ALPHA-3 country codes so the data is easily interpretable, and also use a standard date format that is easier to parse... the ISO format 2020-03-20T14:43:04 used some of the time is fine.
Thanks for your great work, it is a huge amount of work and much appreciated!