CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

(I can help) Data is not clean nor standardized leading to difficulties in analysis #1857

Open felipeblazing opened 4 years ago

felipeblazing commented 4 years ago

I would LOVE to help with this. I have been analyzing the data an on a few days the way that the information is being represented changes. At first there's less geographic granularity then granularity increases (adding admin and admin2 for exmaple) but we still see some issues with countries appearing with multiple names e.g. "Republic of Ireland" and "Ireland" appearing seperately. And having a single format for all months even if in some cases the data will be NULL because we cant retroactively add these data points in the past without someone manually entering that information from wherever it orginated.

For myself I have basically done this in python so that I can have something that is a bit "cleaner" to work with.

Lucas-Czarnecki commented 4 years ago

(I can also help) I'm an R-user, but it sounds to me like we are both working towards the same thing. I created a repository that wrangles the JHU CSSE data with the goal of creating a stable data set that uses consistent naming conventions - plus fixes some other problems. That part I've managed to do. (Depending on how important it is to folks some of the NULL values could also be fixed by mapping data rather than manually entering them). You can see my repo here: https://github.com/Lucas-Czarnecki/COVID-19-CLEANED-JHUCSSE

The part that I have found too overwhelming to even start, as just one person, is to log and address the various discrepancies in the data; such as the one you mentioned about the "Republic of Ireland" also being coded as "Ireland".

Do folks think it is possible (worth it) to work together on logging these discrepancies? I'm not sure where to start.

felipeblazing commented 4 years ago

I am wondering if we should switch to time series. For lack of a better way of doing this I am posting the python code I use to do the cleaning here

import glob
import os
import datetime
import cudf

prov_state = ['Province/State','Province/State', 'Province_State']
country_region = ['Country/Region', 'Country_Region']
city = ['Admin2']
started = False
tables = []
for name in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv'):
    date_str = os.path.basename(name)[0:len(os.path.basename(name)) - 4]
    report_date = datetime.datetime.strptime(date_str, "%m-%d-%Y")
    #print(report_date)
    #print(date_str)
    table = cudf.read_csv(name)
    cols_to_remove = []
    has_city = False
    for col in table:
        if col in city:
            has_city = True

    for col in table:
        col = col.lstrip().rstrip()
        if(col in prov_state):

            if( not has_city):
                table["prov_state"] = table[col].str.replace(r"\(.*\)","").str.rstrip().str.lstrip()
                mask = table["prov_state"].str.contains(',')
                table["prov_state"].loc[~mask] = table["prov_state"][~mask].str.insert(repl=',')
                split_data = table["prov_state"].str.split(",")
                table["city"] = split_data[0].str.rstrip().str.lstrip()
                table["prov_state"] = split_data[1].str.rstrip().str.lstrip()
            else:
                table["prov_state"] = table[col].str.replace(r"\(.*\)","").str.rstrip().str.lstrip()
                #print(split_data)
            #table['prov_state'][not table[col].str.replace(r"\(.*\)","").str.contains(",")] = "," + table['prov_state'].astype(str)
            cols_to_remove.append(col)
        if col in country_region:
            table['country_region'] = table[col].str.replace(r"\(.*\)","").str.replace("occupied Palestinian territory","Palestine").str.replace("the Congo","Congo").str.replace("Macao SAR","Macau").str.replace("Republic of","").str.replace("Hong Kong SAR","Hong Kong").str.replace("Taiwan\*","Taiwan").str.replace("Mainland China","China").str.replace("Bahamas, The","Bahamas").str.replace(" The","").str.replace("The ","").str.replace(",","").str.rstrip().str.lstrip()
            cols_to_remove.append(col)
        if col in city:
            table["city"] = table[col].str.replace(r"\(.*\)","").str.rstrip().str.lstrip()
            cols_to_remove.append(col)        
    for col in cols_to_remove:
        del table[col]
    table["report_date"] = [report_date] * len(table)
    #print(table)
    table =table[["country_region","prov_state","city","Confirmed","Deaths","Recovered","report_date"]]
    #print("success")
    tables.append(table)

data = cudf.concat(tables)
    #nameSeries = cudf.Series([os.path.basename(name)[0:len(os.path.basename(name)) - 4]] * len(table))
    #table['ticker'] = nameSeries

You can see some of the wrangling there. I will try and do a better job of documenting this.

felipeblazing commented 4 years ago

This is country names I found that can be inconsistent

table['country_region'] = table[col].str.replace(r"\(.*\)","").str.replace("occupied Palestinian territory","Palestine")
.str.replace("the Congo","Congo").str.replace("Macao SAR","Macau")
.str.replace("Republic of","").str.replace("Hong Kong SAR","Hong Kong").str.replace("Taiwan\*","Taiwan")
.str.replace("Mainland China","China")
.str.replace("Bahamas, The","Bahamas").str.replace(" The","")
.str.replace("The ","").str.replace(",","").str.rstrip().str.lstrip()
felipeblazing commented 4 years ago

I have not attempted to do this for city and states yet. This process is manual and slow.

oltdaniel commented 4 years ago

@felipeblazing This is why developers need to convert this data to their own data structure to fit their needs. I personally to this on https//c.oltdanie.at every nigh (german time-zone) to have a good data source. @CSSEGISandData delivers the best data we can find, and as they have different data sources devlivering data in different formats we need to accept their changes and maybe short or without notice data format changes. Its a crisis and @CSSEGISandData tries to handle an event that hasn't happen in this kind of spread yet (with the known tech of today). Please close the clean up the @CSSEGISandData issue work. Try to convert the data, before feeding to your application.

GeoRover commented 4 years ago

@felipeblazing This is why developers need to convert this data to their own data structure to fit their needs. I personally to this on https//c.oltdanie.at every nigh (german time-zone) to have a good data source. @CSSEGISandData delivers the best data we can find, and as they have different data sources devlivering data in different formats we need to accept their changes and maybe short or without notice data format changes. Its a crisis and @CSSEGISandData tries to handle an event that hasn't happen in this kind of spread yet (with the known tech of today). Please close the clean up the @CSSEGISandData issue work. Try to convert the data, before feeding to your application.

Actually the tech and tools are available to handle petabytes of data and above, however the engineers/developers @CSSEGISandData may not have had the foresight to put together a structured data like this in a very short time. I am contemplating developing a REST API that offers cleaned up data to all based off of this one. I quickly built a tool for my own use that does that already but haven't verified the data quality yet: Check it out at http://covid19.geomentary.com/

oltdaniel commented 4 years ago

@GeoRover It all comes down to official reported numbers or estimated numbers with newspapers e.g. so on, it is hard o pin down to a single figure.

cipriancraciun commented 4 years ago

For those interested I have already spent a few days cleaning, normalizing and augmenting the data in a derived dataset (it covers both countries, US states and US counties):

(As a bonus, I've also included the NY Times and ECDC datasets as alternative data sources for those that want them.)