describe the data - Githubissues

hnykda commented 4 years ago

To prevent that everyone is duplicating the data understanding, it would be nice to have a data description in the repository (data can be download here). That means that every data file(-type) should be described as:

what the name of the datafile means (e.g. cities/123-3.tsv - is that a city with ID 123 from the md_cities.tsv? What's the suffix -3? - UPDATE: Jan Kulveit will tell us which is the correct one!)
what the specific datafile represents/stores (e.g. "modelled data for a specific region")
what each column means (e.g. "Median means number of infected people", "Timestep corresponds to a day", ...)
what datatypes are in the columns (string, category, int32, float64, ...)
how to fix the data if there are any errors (partly done)
how to load each dataset (partly done)

Bonus

automating what can be done (e.g. the data prep, or the loading)

Hints

what "active cases" means in the {area-type}/{filename}.tsv - median or maybe this difference?
Daniel's proposal
from the call agenda - basically, we'll have e.g. 4 or up to 8 different versions of the same file, each corresponding to a different model

AC

there is a clear description of the available datasets in the data-prep/README.md and a way how to load them and work with them

gavento commented 4 years ago

There is a new tool to do the wrangling directly from GleamViz hdf5 files, see #29

hnykda commented 4 years ago

https://www.notion.so/Data-specs-b0a76e480f3c4eedacf7b9384ca3aa67#82bfb22bb192478481b602b0246950b9

epidemics / covid

describe the data #12

Bonus

Hints

AC