ckingbailey / covid-etl

Load COVID data sources into a Google Sheet
GNU Affero General Public License v3.0
0 stars 0 forks source link

Which data sources to use? #3

Open ckingbailey opened 4 years ago

ckingbailey commented 4 years ago

What data sources do we want to use?

There are many out there. Two are

https://github.com/datasets/covid-19

and

https://github.com/CSSEGISandData/COVID-19 (from Johns Hopkins Uni)

Also the NY Times maintains a GH repo of US-only data

There's this from the Atlantic

This tweet has a few more

Which data sources we choose depends on what we're interested in

I want a world total.

I want a US total.

I want some US county- or region-level data, such as Bay Area and New York. I may want other US regions later, such as less populous states that may soon see infection rates rising.

I want certain countries. I was interested in Italy. Now I'm more interested in Spain. I'd like to see South Korea, Japan, and maybe China. I may want to keep tabs on India and Mexico in the future.

ckingbailey commented 4 years ago

I realized there's some extra complexity here coz we need two data sources: one for US and one for the world. So then we'll need two scheduled functions, one to fetch each of those data sources.

We'll probably want two buckets, too, one for each data set, US and world. In that case we'll need two transform functions, too.

Once the data is transformed into the shape we want, it can all go into one bucket, the processed-data bucket we've already created.

How should we fire the last function: fire it on a timer, or fire it on bucket PUT?

covid-etl_data-flow

ckingbailey commented 4 years ago

Here's another data source https://coronadatascraper.com/timeseries.csv