danqing / covid19

Visualize and compare COVID 19 growth rates of different countries
https://cream.io
MIT License
22 stars 3 forks source link

Use JHU data instead of ourworldindata #23

Closed vpontis closed 4 years ago

vpontis commented 4 years ago

Pull data from: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

I wrote a Python script to process the JHU data and convert it to JSON.

TODO

danqing commented 4 years ago

I took a look at JHU data and I think we can do the following:

  1. Exclude cruise from countries (we can put this in the footnote). Other than the row with country = cruise ship, ignore all rows with province containing "princess".
  2. Only a few countries have province data, and we should:
    • For China, exclude HK, Macau and rename it to China mainland. HK and Macau can be their own record.
    • Taiwan is already excluded from China in the csv. There's an asterisk that we need to remove.
    • US we remove everything with a comma in the province column - these are county data. Then we remove Puerto Rico, Guam and Virgin Island and let them be their own record.
    • Denmark, Netherland, France and UK has one record for themselves and then records for their overseas territories. Let them all be separate records.
  3. All provinces can also be their own records as a separate entry. Georgia may need to be renamed as Georgia (US).
danqing commented 4 years ago

This script below handles name conversions by:

I think if you can apply this prior to your script it should mostly do it?

from csv import DictReader, DictWriter

PROVINCE = "Province/State"
REGION = "Country/Region"

src = DictReader(open("time_series_19-covid-Confirmed.csv"))
dst = []

for r in src:
  if r[REGION] == "Cruise Ship":
    r[PROVINCE] = "Cruise Ship"

  if "," in r[PROVINCE] or "Princess" in r[PROVINCE]:
    continue

  if r[REGION] == "Taiwan*":
    r[PROVINCE] = "Taiwan"
  if r[REGION] == "Korea, South":
    r[PROVINCE] = "South Korea"

  if r[REGION] in ["France", "Denmark", "Netherlands", "United Kingdom"] and r[PROVINCE] != r[REGION]:
    r[REGION] = r[PROVINCE]

  if r[PROVINCE] in ["Hong Kong", "Macau", "Puerto Rico", "Guam", "Virgin Islands"]:
    r[REGION] = r[PROVINCE]

  if r[PROVINCE] == "Georgia" and r[REGION] == "US":
    r[PROVINCE] = "Georgia (US)"

  if r[PROVINCE] == "":
    r[PROVINCE] = r[REGION]

  if r[REGION] == "US":
    r[REGION] = "United States"

  dst.append(r)

w = DictWriter(open("out.csv", "w+"), fieldnames=src.fieldnames)
w.writeheader()
w.writerows(dst)