CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.12k stars 18.42k forks source link

Pandas users - Code - Programatically fix wrong CSV data after automated download #833

Open r-lomba opened 4 years ago

r-lomba commented 4 years ago

For those using Pandas, I propose the following fix for the 3/12 and 3/15 problems (wrong data, that apparently is not going to be corrected). Sources are cited. Not all the corrections, but the vast majority of the missing quantities are inserted here.

This approach is good to manage possible new errors in the future too, and does not impact future data fromJHU

This allows you to:

HTH

HARDCODED DICTIONARIES - FIXES TO MISSING DATA ON THE ORIGINAL DATASOURCE THAT NEVER GOT CORRECTED

SOURCES: https://www.worldometers.info/coronavirus/ https://ourworldindata.org/coronavirus-source-data

confirmed_fixes_dict = {'Italy|2020-03-12': 15113,
                        'Spain|2020-03-12': 3146,
                        'France|2020-03-12': 2876,
                        'United Kingdom|2020-03-12': 590,
                        'Germany|2020-03-12': 2745,
                        'Argentina|2020-03-12': 19,
                        'Australia|2020-03-12': 122,
                        'Belgium|2020-03-12': 314,
                        'Chile|2020-03-12': 23,
                        'Colombia|2020-03-12': 9,
                        'Greece|2020-03-12': 98,
                        'Indonesia|2020-03-12': 34,
                        'Ireland|2020-03-12': 43,
                        'Japan|2020-03-12': 620,
                        'Netherlands|2020-03-12': 503,
                        'Qatar|2020-03-12': 262,
                        'Singapore|2020-03-12': 178,
                        'United Kingdom|2020-03-15': 1391,
                        'France|2020-03-15': 5423}

deaths_fixes_dict = {'Italy|2020-03-12': 1016,
                     'Spain|2020-03-12': 86,
                     'France|2020-03-12': 61,
                     'Germany|2020-03-12': 6,
                     'Argentina|2020-03-12': 1,
                     'Australia|2020-03-12': 3,
                     'Greece|2020-03-12': 1,
                     'Indonesia|2020-03-12': 1,
                     'Ireland|2020-03-12': 1,
                     'Japan|2020-03-12': 15,
                     'Netherlands|2020-03-12': 5,
                     'Switzerland|2020-03-12': 4,
                     'United Kingdom|2020-03-15': 35,
             'France|2020-03-15': 127}

recovered_fixes_dict = {'Italy|2020-03-12': 1258,
                        'Spain|2020-03-12': 189,
                        'France|2020-03-12': 12,
                        'Germany|2020-03-12': 25}

for key in confirmed_fixes_dict.keys():
    country_to_be_fixed = key.split('|')[0]
    date_to_be_fixed = key.split('|')[1]
    value_to_be_fixed = fixes_dict[key]
    dataframe_confirmed_DF.at[country_to_be_fixed, date_to_be_fixed] = value_to_be_fixed

for key in deaths_fixes_dict.keys():
    country_to_be_fixed = key.split('|')[0]
    date_to_be_fixed = key.split('|')[1]
    value_to_be_fixed = fixes_dict[key]
    dataframe_deaths_DF.at[country_to_be_fixed, date_to_be_fixed] = value_to_be_fixed

for key in recovered_fixes_dict.keys():
    country_to_be_fixed = key.split('|')[0]
    date_to_be_fixed = key.split('|')[1]
    value_to_be_fixed = fixes_dict[key]
    dataframe_recovered_DF.at[country_to_be_fixed, date_to_be_fixed] = value_to_be_fixed
JiPiBi commented 4 years ago

Thanks , it gives idea for data correction when updating

I think also that the ECDC file could be usefull

Be aware that for France there are several lines and that your corrected number is the sum of the lines and not Mainland France only (but in the JH , some territories remain as Country : Martinique ...)

Province/State Country/Region
France France
St Martin France
Saint Barthelemy France
French Polynesia France
French Guiana France
Mayotte France
JiPiBi commented 4 years ago

Another comment , the values for recovered in France are not very reliable, I hope that more than 12 people recovered ....

In fact what is more important for the future will be to get the current number of people under ICU , and that values are not very much broadcasted apart by the italians

r-lomba commented 4 years ago

@JiPiBi I agree with both your comments... Regarding the Countries with many Regions belonging to them, I've indicated a warning. Regarding the reliability of data available on different sources, there's nothing we can do about it... We can just choose the most "appealing" source from time to time and do our best :)

JiPiBi commented 4 years ago

Suggestion for a new fix : if you are aggregating at Country level , you have to take into account also Reunion, Martinique, Guadeloupe (Replace them by France before aggregating )

nanami-773 commented 4 years ago

Diamond Princess Cruise Ship should be 697 on 3/12, 712 on 3/15.

JiPiBi commented 4 years ago

@r-lomba

I used and modified your code to adapt it to corrections at the Province's level I dropped the lines about Australia because I need the detailled impacted Provinces

# idea for fixing bad values
confirmed_fixes_dict = {'Italy|3/12/20': 15113,
                        'Spain|3/12/20': 3146,
                        'France|3/12/20': 2876,
                        'United Kingdom|3/12/20': 590,
                        'Germany|3/12/20': 2745,
                        'Argentina|3/12/20': 19,

                        'Belgium|3/12/20': 314,
                        'Chile|3/12/20': 23,
                        'Greece|3/12/20': 98,
                        'Indonesia|3/12/20': 34,
                        'Ireland|3/12/20': 43,
                        'Japan|3/12/20': 620,
                        'Netherlands|3/12/20': 503,
                        'Qatar|3/12/20': 262,
                        'Singapore|3/12/20': 178,
                        'United Kingdom|3/12/20': 1391,
                        'France|3/15/20': 5423}

deaths_fixes_dict = {'Italy|3/12/20': 1016,
                     'Spain|3/12/20': 86,
                     'France|3/12/20': 61,
                     'Germany|3/12/20': 6,
                     'Argentina|3/12/20': 1,
                     'Greece|3/12/20': 1,
                     'Indonesia|3/12/20': 1,
                     'Ireland|3/12/20': 1,
                     'Japan|3/12/20': 15,
                     'Netherlands|3/12/20': 5,
                     'Switzerland|3/12/20': 4,
                     'United Kingdom|3/15/20': 35,
                     'France|3/15/20': 127}

recovered_fixes_dict = {'Italy|03/12/20': 1258,
                        'Spain|03/12/20': 189,
                        'Germany|03/12/20': 25}

def updated2(df,dictfix) :
    for key in dictfix.keys():
        country_to_be_fixed, date_to_be_fixed = key.split('|')
        value_to_be_fixed = dictfix[key]
        idxlist = df.index[df['Province/State']==country_to_be_fixed].tolist()
        df.loc[idxlist[0],date_to_be_fixed] = value_to_be_fixed

listpd = [pdconfirmed,pddeaths,pdrecovered]
listdict = [confirmed_fixes_dict,deaths_fixes_dict,recovered_fixes_dict]
for df,dictfix in zip(listpd,listdict) :       
    updated2(df,dictfix)
r-lomba commented 4 years ago

@JiPiBi Very nice! Thanks...

yy commented 4 years ago

Hey I'm building a transparent data pipeline: https://github.com/covid19-data/covid19-data Feel free to use/contribute.

alfkoehn commented 4 years ago

I think you might have used a few wrong numbers there, at least they are in contradiction to what we previously "agreed" on (stated in multiple issues here), see here: https://github.com/CSSEGISandData/COVID-19/issues/599#issuecomment-598489792

confirmed_fixes_dict = { 'Spain|2020-03-12': 3146,

that seems to be rather 2950

                    'Germany|2020-03-12': 2745,

that seems to be rather 2369

                    'Belgium|2020-03-12': 314,

that might be 399 (unfortunately I cannot find my original source for that)

deaths_fixes_dict = { 'Spain|2020-03-12': 86, 'Germany|2020-03-12': 6,

for Spain that should be 84, and for Germany 5

recovered_fixes_dict = { 'France|2020-03-12': 12,

that seems to be 61

iandoug commented 4 years ago

@r-lomba You have same date twice for UK confirmed.

FWIW, I have been using the Chinese site, and have snapshots taken every half an hour of their data (which is all on the main page, in the code).

If I use the numbers from 13 March just before midnight, then I get the following. This produces reasonable curves when plotted with the other data. There are "when is end of day" issues, and the Chinese don't update their site very often during the day, so some numbers may be a bit laggy behind other sites.

Anyway, this is what I got. Some agree with r-lomba's, some I tweaked. Japan was interpolated. This is basic assignments in PHP and probably not the most elegant way to do it. Sorry not in same format as above.

$confirmed_fixes_dict['Argentina']['2020-03-12']=31; $confirmed_fixes_dict['Australia']['2020-03-12']=144; $confirmed_fixes_dict['Belgium']['2020-03-12']=556; $confirmed_fixes_dict['Chile']['2020-03-12']=33; $confirmed_fixes_dict['Colombia']['2020-03-12']=9; $confirmed_fixes_dict['France']['2020-03-12']=2876; $confirmed_fixes_dict['France']['2020-03-15']=5423; $confirmed_fixes_dict['Germany']['2020-03-12']=3117; $confirmed_fixes_dict['Greece']['2020-03-12']=117; $confirmed_fixes_dict['Indonesia']['2020-03-12']=69; $confirmed_fixes_dict['Ireland']['2020-03-12']=70; $confirmed_fixes_dict['Italy']['2020-03-12']=15113; $confirmed_fixes_dict['Japan']['2020-03-12']=670; $confirmed_fixes_dict['Netherlands']['2020-03-12']=804; $confirmed_fixes_dict['Qatar']['2020-03-12']=262; $confirmed_fixes_dict['Singapore']['2020-03-12']=200; $confirmed_fixes_dict['Spain']['2020-03-12']=4209; $confirmed_fixes_dict['Switzerland']['2020-03-12']=1009; $confirmed_fixes_dict['United Kingdom']['2020-03-12']=590; $confirmed_fixes_dict['United Kingdom']['2020-03-15']=1391;

$deaths_fixes_dict['Argentina']['2020-03-12']=1; $deaths_fixes_dict['Australia']['2020-03-12']=3; $deaths_fixes_dict['France']['2020-03-12']=61; $deaths_fixes_dict['France']['2020-03-15']=127; $deaths_fixes_dict['Germany']['2020-03-12']=7; $deaths_fixes_dict['Greece']['2020-03-12']=1; $deaths_fixes_dict['Indonesia']['2020-03-12']=1; $deaths_fixes_dict['Ireland']['2020-03-12']=1; $deaths_fixes_dict['Italy']['2020-03-12']=1016; $deaths_fixes_dict['Japan']['2020-03-12']=19; $deaths_fixes_dict['Netherlands']['2020-03-12']=5; $deaths_fixes_dict['Spain']['2020-03-12']=120; $deaths_fixes_dict['Switzerland']['2020-03-12']=4; $deaths_fixes_dict['United Kingdom']['2020-03-12']=6; $deaths_fixes_dict['United Kingdom']['2020-03-15']=35;

$recovered_fixes_dict['France']['2020-03-12']=12; $recovered_fixes_dict['Germany']['2020-03-12']=14; $recovered_fixes_dict['Italy']['2020-03-12']=1258; $recovered_fixes_dict['Spain']['2020-03-12']=189; $recovered_fixes_dict['United Kingdom']['2020-03-12']=8;

Comments/improvements welcome.

Cheers, Ian

r-lomba commented 4 years ago

I'm glad you all guys are finding the idea useful. Basically, even more important than having the perfect exact quantities in place, what I hoped was to make it possible that with a single snippet of code everybody using python/pandas and streaming CSV on a regular basis had a better life...

The main point here is that applying this idea, everybody can have an easy-to-maintain PERSONAL dictionary that will "fill the holes" in old data that is never going to be corrected at the source, making it easier to fit splines or show a chart on a web page without macroscopic inconsistencies! And to provide a more reliable service to our end users :)

iandoug commented 4 years ago

There as few more countries with bad data again. Eg Switzerland.

I can't baby-sit data daily to see what's broken and then try to fix it. Think I must revert to using the Chinese data, it's a bit saner.