InstituteforDiseaseModeling / covasim

COVID-19 Agent-based Simulator (Covasim): a model for exploring coronavirus dynamics and interventions
https://covasim.org
MIT License
255 stars 224 forks source link

Scrape best available epi data #43

Closed cliffckerr closed 4 years ago

cliffckerr commented 4 years ago

This is an involved project and may even require its own repo, but creating an issue here to get the conversation started. The task is:

We need the best available auto-updated epidemiological data at as fine a geographical resolution as possible.

Specifically, the data we need is as many of the following as possible, in order of importance:

  1. Number of deaths (on date died)
  2. Number of positive diagnoses (on date test performed)
  3. Number of importations (especially at start of outbreak)
  4. Number of people hospitalized (on date hospitalized)
  5. Number of people in ICU (on date admitted to ICU)
  6. Number of negative diagnoses (on date test performed)

There are various tools that already collate some of this, e.g. https://neherlab.org/covid19/ and https://coronavirus.jhu.edu/map.html. The task is to find the best available data sources and collate everything into a consistent format. Top priority is Africa and LMIC countries, but as broad as coverage as possible.

inc0 commented 4 years ago

https://github.com/CSSEGISandData/COVID-19 is best dataset I know of right now

ckerr-IDM commented 4 years ago

Agree -- can we get that in a form ingestible by https://github.com/InstituteforDiseaseModeling/covasim/blob/develop/covasim/parameters.py#L109 ?

gwincr11 commented 4 years ago

Do we have anything other then code, that defines the cols we want and what they should be called?

ckerr-IDM commented 4 years ago

@gwincr11 Format should like (xlsx or csv format):

day | date | new_positives | new_negatives | new_tests | new_hosp | new_icu | new_death
0 | 2/26/2020 | 0 | 2 | 2 | 0 | 0 | 0
1 | 2/27/2020 | 1 | 0 | 1 | 0 | 0 | 0
2 | 2/28/2020 | 0 | 1 | 1 | 0 | 0 | 0
3 | 2/29/2020 | 1 | 11 | 12 | 1 | 0 | 0
4 | 3/1/2020 | 1 | 5 | 6 | 0 | 0 | 0
5 | 3/2/2020 | 0 | 16 | 16 | 0 | 0 | 0
6 | 3/3/2020 | 0 | 23 | 23 | 0 | 0 | 0
7 | 3/4/2020 | 0 | 16 | 16 | 0 | 0 | 0
8 | 3/5/2020 | 3 | 39 | 42 | 0 | 0 | 0
9 | 3/6/2020 | 6 | 34 | 40 | 4 | 2 | 0
10 | 3/7/2020 | 4 | 55 | 59 | 2 | 0 | 0
11 | 3/8/2020 | 1 | 41 | 42 | 1 | 1 | 0
12 | 3/9/2020 | 5 | 95 | 100 | 1 | 0 | 0
inc0 commented 4 years ago

let's add region to this too, so we can select per-country and per-state

willf commented 4 years ago

To me, it looks like the best, shared resource is https://github.com/covidatlas/coronadatascraper which is collecting and validating on coronavirus data; any new data sources we find might be usefully brought through them.

Their time series data has the following columns (in addition to geolocation and sourcing data):

I am working on a (pandas-based) script to create 'new_x' columns (new_death, etc) and day columns.

But, this particular data will not have provide new_negatives, new_hosp or new_icu data.

As of 4/4, it has data from 179 countries, 336 state-level divisions, 3078 county-level divisions, and 39 cities.

I will have the 4/4 data available later today, and the conversion script as well.

Let me know if any of this is problematic.

cliffckerr commented 4 years ago

Closed by @willf (#58 )