Open r-lomba opened 4 years ago
Thanks , it gives idea for data correction when updating
I think also that the ECDC file could be usefull
Be aware that for France there are several lines and that your corrected number is the sum of the lines and not Mainland France only (but in the JH , some territories remain as Country : Martinique ...)
Province/State | Country/Region |
---|---|
France | France |
St Martin | France |
Saint Barthelemy | France |
French Polynesia | France |
French Guiana | France |
Mayotte | France |
Another comment , the values for recovered in France are not very reliable, I hope that more than 12 people recovered ....
In fact what is more important for the future will be to get the current number of people under ICU , and that values are not very much broadcasted apart by the italians
@JiPiBi I agree with both your comments... Regarding the Countries with many Regions belonging to them, I've indicated a warning. Regarding the reliability of data available on different sources, there's nothing we can do about it... We can just choose the most "appealing" source from time to time and do our best :)
Suggestion for a new fix : if you are aggregating at Country level , you have to take into account also Reunion, Martinique, Guadeloupe (Replace them by France before aggregating )
Diamond Princess Cruise Ship should be 697 on 3/12, 712 on 3/15.
@r-lomba
I used and modified your code to adapt it to corrections at the Province's level I dropped the lines about Australia because I need the detailled impacted Provinces
# idea for fixing bad values
confirmed_fixes_dict = {'Italy|3/12/20': 15113,
'Spain|3/12/20': 3146,
'France|3/12/20': 2876,
'United Kingdom|3/12/20': 590,
'Germany|3/12/20': 2745,
'Argentina|3/12/20': 19,
'Belgium|3/12/20': 314,
'Chile|3/12/20': 23,
'Greece|3/12/20': 98,
'Indonesia|3/12/20': 34,
'Ireland|3/12/20': 43,
'Japan|3/12/20': 620,
'Netherlands|3/12/20': 503,
'Qatar|3/12/20': 262,
'Singapore|3/12/20': 178,
'United Kingdom|3/12/20': 1391,
'France|3/15/20': 5423}
deaths_fixes_dict = {'Italy|3/12/20': 1016,
'Spain|3/12/20': 86,
'France|3/12/20': 61,
'Germany|3/12/20': 6,
'Argentina|3/12/20': 1,
'Greece|3/12/20': 1,
'Indonesia|3/12/20': 1,
'Ireland|3/12/20': 1,
'Japan|3/12/20': 15,
'Netherlands|3/12/20': 5,
'Switzerland|3/12/20': 4,
'United Kingdom|3/15/20': 35,
'France|3/15/20': 127}
recovered_fixes_dict = {'Italy|03/12/20': 1258,
'Spain|03/12/20': 189,
'Germany|03/12/20': 25}
def updated2(df,dictfix) :
for key in dictfix.keys():
country_to_be_fixed, date_to_be_fixed = key.split('|')
value_to_be_fixed = dictfix[key]
idxlist = df.index[df['Province/State']==country_to_be_fixed].tolist()
df.loc[idxlist[0],date_to_be_fixed] = value_to_be_fixed
listpd = [pdconfirmed,pddeaths,pdrecovered]
listdict = [confirmed_fixes_dict,deaths_fixes_dict,recovered_fixes_dict]
for df,dictfix in zip(listpd,listdict) :
updated2(df,dictfix)
@JiPiBi Very nice! Thanks...
Hey I'm building a transparent data pipeline: https://github.com/covid19-data/covid19-data Feel free to use/contribute.
I think you might have used a few wrong numbers there, at least they are in contradiction to what we previously "agreed" on (stated in multiple issues here), see here: https://github.com/CSSEGISandData/COVID-19/issues/599#issuecomment-598489792
confirmed_fixes_dict = { 'Spain|2020-03-12': 3146,
that seems to be rather 2950
'Germany|2020-03-12': 2745,
that seems to be rather 2369
'Belgium|2020-03-12': 314,
that might be 399 (unfortunately I cannot find my original source for that)
deaths_fixes_dict = { 'Spain|2020-03-12': 86, 'Germany|2020-03-12': 6,
for Spain that should be 84, and for Germany 5
recovered_fixes_dict = { 'France|2020-03-12': 12,
that seems to be 61
@r-lomba You have same date twice for UK confirmed.
FWIW, I have been using the Chinese site, and have snapshots taken every half an hour of their data (which is all on the main page, in the code).
If I use the numbers from 13 March just before midnight, then I get the following. This produces reasonable curves when plotted with the other data. There are "when is end of day" issues, and the Chinese don't update their site very often during the day, so some numbers may be a bit laggy behind other sites.
Anyway, this is what I got. Some agree with r-lomba's, some I tweaked. Japan was interpolated. This is basic assignments in PHP and probably not the most elegant way to do it. Sorry not in same format as above.
$confirmed_fixes_dict['Argentina']['2020-03-12']=31; $confirmed_fixes_dict['Australia']['2020-03-12']=144; $confirmed_fixes_dict['Belgium']['2020-03-12']=556; $confirmed_fixes_dict['Chile']['2020-03-12']=33; $confirmed_fixes_dict['Colombia']['2020-03-12']=9; $confirmed_fixes_dict['France']['2020-03-12']=2876; $confirmed_fixes_dict['France']['2020-03-15']=5423; $confirmed_fixes_dict['Germany']['2020-03-12']=3117; $confirmed_fixes_dict['Greece']['2020-03-12']=117; $confirmed_fixes_dict['Indonesia']['2020-03-12']=69; $confirmed_fixes_dict['Ireland']['2020-03-12']=70; $confirmed_fixes_dict['Italy']['2020-03-12']=15113; $confirmed_fixes_dict['Japan']['2020-03-12']=670; $confirmed_fixes_dict['Netherlands']['2020-03-12']=804; $confirmed_fixes_dict['Qatar']['2020-03-12']=262; $confirmed_fixes_dict['Singapore']['2020-03-12']=200; $confirmed_fixes_dict['Spain']['2020-03-12']=4209; $confirmed_fixes_dict['Switzerland']['2020-03-12']=1009; $confirmed_fixes_dict['United Kingdom']['2020-03-12']=590; $confirmed_fixes_dict['United Kingdom']['2020-03-15']=1391;
$deaths_fixes_dict['Argentina']['2020-03-12']=1; $deaths_fixes_dict['Australia']['2020-03-12']=3; $deaths_fixes_dict['France']['2020-03-12']=61; $deaths_fixes_dict['France']['2020-03-15']=127; $deaths_fixes_dict['Germany']['2020-03-12']=7; $deaths_fixes_dict['Greece']['2020-03-12']=1; $deaths_fixes_dict['Indonesia']['2020-03-12']=1; $deaths_fixes_dict['Ireland']['2020-03-12']=1; $deaths_fixes_dict['Italy']['2020-03-12']=1016; $deaths_fixes_dict['Japan']['2020-03-12']=19; $deaths_fixes_dict['Netherlands']['2020-03-12']=5; $deaths_fixes_dict['Spain']['2020-03-12']=120; $deaths_fixes_dict['Switzerland']['2020-03-12']=4; $deaths_fixes_dict['United Kingdom']['2020-03-12']=6; $deaths_fixes_dict['United Kingdom']['2020-03-15']=35;
$recovered_fixes_dict['France']['2020-03-12']=12; $recovered_fixes_dict['Germany']['2020-03-12']=14; $recovered_fixes_dict['Italy']['2020-03-12']=1258; $recovered_fixes_dict['Spain']['2020-03-12']=189; $recovered_fixes_dict['United Kingdom']['2020-03-12']=8;
Comments/improvements welcome.
Cheers, Ian
I'm glad you all guys are finding the idea useful. Basically, even more important than having the perfect exact quantities in place, what I hoped was to make it possible that with a single snippet of code everybody using python/pandas and streaming CSV on a regular basis had a better life...
The main point here is that applying this idea, everybody can have an easy-to-maintain PERSONAL dictionary that will "fill the holes" in old data that is never going to be corrected at the source, making it easier to fit splines or show a chart on a web page without macroscopic inconsistencies! And to provide a more reliable service to our end users :)
There as few more countries with bad data again. Eg Switzerland.
I can't baby-sit data daily to see what's broken and then try to fix it. Think I must revert to using the Chinese data, it's a bit saner.
For those using Pandas, I propose the following fix for the 3/12 and 3/15 problems (wrong data, that apparently is not going to be corrected). Sources are cited. Not all the corrections, but the vast majority of the missing quantities are inserted here.
This approach is good to manage possible new errors in the future too, and does not impact future data fromJHU
This allows you to:
Easily maintain a unique Dictionary of corrections that can be plugged in in your code
Pull the original CSVs from the CSSE source programatically, without worrying about having to fix them manually each time
After download in Pandas Dataframe, apply the below fix in your code and 3/12 issue is fixed
Proceed with your analysis further down in your code :)
WARNING: maybe the Date format I used won't fit everybody's need. Adjust accordingly!
WARNING: If you notice, I do not manage Regions in my code! That's because in my application I aggregate data at Country level. If you do need regions, change the "update" part of the code for multi-region Countries!
HTH
HARDCODED DICTIONARIES - FIXES TO MISSING DATA ON THE ORIGINAL DATASOURCE THAT NEVER GOT CORRECTED
SOURCES: https://www.worldometers.info/coronavirus/ https://ourworldindata.org/coronavirus-source-data