Open willhaslett opened 4 years ago
@willhaslett that's great. We actually did something like this a couple of weeks ago for Open Data Day and repo is here (details in https://www.datopian.com/blog/2020/03/17/odd-covid-19/):
https://github.com/datasets/covid-19
We also auto-generate JSON data from CSV plus instructions for use in tools like pandas on the DataHub.io page here:
https://datahub.io/core/covid-19#data https://datahub.io/core/covid-19#data-cli
Hope that is useful.
@rufuspollock Thank you for those links. What I'm doing is functionally different, particularly in handling the US data.
The output US dataframes, CSV files and JSON files are not mirrors of the original data, they include parsing of the Province/State data into appropriate columns, and include region, sub-region, and population data for each state. There are a couple of examples below.
Also, process.py
from https://github.com/datasets/covid-19 took 2:21 to run on my machine, whereas my tool, which does quite a bit more, takes 0:13 thanks to Pandas vector functions.
>>> print(df_us_region_and_state['cases'])
date region sub_region state population cases
0 2020-01-22 midwest east_north_central Illinois 12671821.0 0
1 2020-01-22 midwest east_north_central Indiana 6732219.0 0
2 2020-01-22 midwest east_north_central Michigan 9986857.0 0
3 2020-01-22 midwest east_north_central Ohio 11689100.0 0
4 2020-01-22 midwest east_north_central Wisconsin 5822434.0 0
... ... ... ... ... ... ...
2845 2020-03-18 west pacific Alaska 731545.0 6
2846 2020-03-18 west pacific California 39512223.0 751
2847 2020-03-18 west pacific Hawaii 1415872.0 14
2848 2020-03-18 west pacific Oregon 4217737.0 68
2849 2020-03-18 west pacific Washington 7614893.0 1014
[2850 rows x 6 columns]
>>>
date day cases state county territory other is_state lat long sub_region region population
0 2020-01-22 0 0 Washington None None None True 47.4009 -121.4905 pacific west 7614893.0
1 2020-01-22 0 0 New York None None None True 42.1657 -74.9481 mid_atlantic northeast 19453561.0
2 2020-01-22 0 0 California None None None True 36.1162 -119.6816 pacific west 39512223.0
3 2020-01-22 0 0 Massachusetts None None None True 42.2302 -71.5301 new_england northeast 6892503.0
4 2020-01-22 0 0 None None None Diamond Princess False 35.4437 139.6380 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
13333 2020-03-15 53 0 Delaware NewCastle None None False 39.5393 -75.6674 NaN NaN NaN
13334 2020-03-15 53 12 Alabama None None None True 32.3182 -86.9023 east_south_central south 4903185.0
13335 2020-03-15 53 3 None None Puerto Rico None False 18.2208 -66.5901 NaN NaN NaN
13336 2020-03-15 53 1 None None Virgin Islands None False 18.3358 -64.8963 NaN NaN NaN
13337 2020-03-15 53 3 None None Guam None False 13.4443 144.7937 NaN NaN NaN
[13338 rows x 13 columns]
I'm also in the process of adding automated uploading of these data structures into Google Firebase, perhaps the leading backend solution for mobile apps. I want app developers to be able to just write apps and not have to worry about the data.
Once I'm done with features for the country that I'm in, I'm going to go back and develop data structures for the global data that provide options for working with those data as well.
Many issues that are posted here pertain to difficulties importing, parsing, and working with these data. I have developed a tool that is being widely used to make that process painless. It gives you parsed and sensible data structures in CSV, JSON, and Pandas dataframes, synchronized with the CSV files here at runtime.
The repo is here: https://github.com/willhaslett/covid-19-growth
If you link to this repo in your README, people will be able to get started with these data much more efficiently.
Will Haslett https://www.c4tbh.org/meet-our-team/william-haslett/