CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

Provide a link to covid-19-growth repo for parsing the data into friendly data structures. #1202

Open willhaslett opened 4 years ago

willhaslett commented 4 years ago

Many issues that are posted here pertain to difficulties importing, parsing, and working with these data. I have developed a tool that is being widely used to make that process painless. It gives you parsed and sensible data structures in CSV, JSON, and Pandas dataframes, synchronized with the CSV files here at runtime.

The repo is here: https://github.com/willhaslett/covid-19-growth

If you link to this repo in your README, people will be able to get started with these data much more efficiently.

Will Haslett https://www.c4tbh.org/meet-our-team/william-haslett/

rufuspollock commented 4 years ago

@willhaslett that's great. We actually did something like this a couple of weeks ago for Open Data Day and repo is here (details in https://www.datopian.com/blog/2020/03/17/odd-covid-19/):

https://github.com/datasets/covid-19

We also auto-generate JSON data from CSV plus instructions for use in tools like pandas on the DataHub.io page here:

https://datahub.io/core/covid-19#data https://datahub.io/core/covid-19#data-cli

Hope that is useful.

willhaslett commented 4 years ago

@rufuspollock Thank you for those links. What I'm doing is functionally different, particularly in handling the US data.

The output US dataframes, CSV files and JSON files are not mirrors of the original data, they include parsing of the Province/State data into appropriate columns, and include region, sub-region, and population data for each state. There are a couple of examples below.

Also, process.py from https://github.com/datasets/covid-19 took 2:21 to run on my machine, whereas my tool, which does quite a bit more, takes 0:13 thanks to Pandas vector functions.

>>> print(df_us_region_and_state['cases'])

           date   region          sub_region       state  population  cases
0    2020-01-22  midwest  east_north_central    Illinois  12671821.0      0
1    2020-01-22  midwest  east_north_central     Indiana   6732219.0      0
2    2020-01-22  midwest  east_north_central    Michigan   9986857.0      0
3    2020-01-22  midwest  east_north_central        Ohio  11689100.0      0
4    2020-01-22  midwest  east_north_central   Wisconsin   5822434.0      0
...         ...      ...                 ...         ...         ...    ...
2845 2020-03-18     west             pacific      Alaska    731545.0      6
2846 2020-03-18     west             pacific  California  39512223.0    751
2847 2020-03-18     west             pacific      Hawaii   1415872.0     14
2848 2020-03-18     west             pacific      Oregon   4217737.0     68
2849 2020-03-18     west             pacific  Washington   7614893.0   1014

[2850 rows x 6 columns]
>>>  
            date  day  cases          state     county       territory             other is_state      lat      long          sub_region     region  population
0     2020-01-22    0      0     Washington       None            None              None     True  47.4009 -121.4905             pacific       west   7614893.0
1     2020-01-22    0      0       New York       None            None              None     True  42.1657  -74.9481        mid_atlantic  northeast  19453561.0
2     2020-01-22    0      0     California       None            None              None     True  36.1162 -119.6816             pacific       west  39512223.0
3     2020-01-22    0      0  Massachusetts       None            None              None     True  42.2302  -71.5301         new_england  northeast   6892503.0
4     2020-01-22    0      0           None       None            None  Diamond Princess    False  35.4437  139.6380                 NaN        NaN         NaN
...          ...  ...    ...            ...        ...             ...               ...      ...      ...       ...                 ...        ...         ...
13333 2020-03-15   53      0       Delaware  NewCastle            None              None    False  39.5393  -75.6674                 NaN        NaN         NaN
13334 2020-03-15   53     12        Alabama       None            None              None     True  32.3182  -86.9023  east_south_central      south   4903185.0
13335 2020-03-15   53      3           None       None     Puerto Rico              None    False  18.2208  -66.5901                 NaN        NaN         NaN
13336 2020-03-15   53      1           None       None  Virgin Islands              None    False  18.3358  -64.8963                 NaN        NaN         NaN
13337 2020-03-15   53      3           None       None            Guam              None    False  13.4443  144.7937                 NaN        NaN         NaN

[13338 rows x 13 columns]
willhaslett commented 4 years ago

I'm also in the process of adding automated uploading of these data structures into Google Firebase, perhaps the leading backend solution for mobile apps. I want app developers to be able to just write apps and not have to worry about the data.

willhaslett commented 4 years ago

Once I'm done with features for the country that I'm in, I'm going to go back and develop data structures for the global data that provide options for working with those data as well.