Use JHU data instead of ourworldindata

vpontis commented 4 years ago

Pull data from: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

I wrote a Python script to process the JHU data and convert it to JSON.

TODO

[ ] Fix daily cases
[ ] Map new country names ("US", "Korea, South") to old country names ("United States", "South Korea")
[ ] Update footer with the new data source
[ ] Update python script to fetch the CSVs from Github
[ ] Verify data against the JHU dashboard (the US data for today looked slightly off)
[ ] Add state data in addition to country data

danqing commented 4 years ago

I took a look at JHU data and I think we can do the following:

Exclude cruise from countries (we can put this in the footnote). Other than the row with country = cruise ship, ignore all rows with province containing "princess".
Only a few countries have province data, and we should:
- For China, exclude HK, Macau and rename it to China mainland. HK and Macau can be their own record.
- Taiwan is already excluded from China in the csv. There's an asterisk that we need to remove.
- US we remove everything with a comma in the province column - these are county data. Then we remove Puerto Rico, Guam and Virgin Island and let them be their own record.
- Denmark, Netherland, France and UK has one record for themselves and then records for their overseas territories. Let them all be separate records.
All provinces can also be their own records as a separate entry. Georgia may need to be renamed as Georgia (US).

danqing commented 4 years ago

This script below handles name conversions by:

making province/state column the display name we can use directly
making country/region column what we can use for aggregation

I think if you can apply this prior to your script it should mostly do it?

from csv import DictReader, DictWriter

PROVINCE = "Province/State"
REGION = "Country/Region"

src = DictReader(open("time_series_19-covid-Confirmed.csv"))
dst = []

for r in src:
  if r[REGION] == "Cruise Ship":
    r[PROVINCE] = "Cruise Ship"

  if "," in r[PROVINCE] or "Princess" in r[PROVINCE]:
    continue

  if r[REGION] == "Taiwan*":
    r[PROVINCE] = "Taiwan"
  if r[REGION] == "Korea, South":
    r[PROVINCE] = "South Korea"

  if r[REGION] in ["France", "Denmark", "Netherlands", "United Kingdom"] and r[PROVINCE] != r[REGION]:
    r[REGION] = r[PROVINCE]

  if r[PROVINCE] in ["Hong Kong", "Macau", "Puerto Rico", "Guam", "Virgin Islands"]:
    r[REGION] = r[PROVINCE]

  if r[PROVINCE] == "Georgia" and r[REGION] == "US":
    r[PROVINCE] = "Georgia (US)"

  if r[PROVINCE] == "":
    r[PROVINCE] = r[REGION]

  if r[REGION] == "US":
    r[REGION] = "United States"

  dst.append(r)

w = DictWriter(open("out.csv", "w+"), fieldnames=src.fieldnames)
w.writeheader()
w.writerows(dst)

danqing / covid19

Use JHU data instead of ourworldindata #23