To prevent that everyone is duplicating the data understanding, it would be nice to have a data description in the repository (data can be download here). That means that every data file(-type) should be described as:
what the name of the datafile means (e.g. cities/123-3.tsv - is that a city with ID 123 from the md_cities.tsv? What's the suffix -3? - UPDATE: Jan Kulveit will tell us which is the correct one!)
what the specific datafile represents/stores (e.g. "modelled data for a specific region")
what each column means (e.g. "Median means number of infected people", "Timestep corresponds to a day", ...)
what datatypes are in the columns (string, category, int32, float64, ...)
how to fix the data if there are any errors (partly done)
how to load each dataset (partly done)
Bonus
automating what can be done (e.g. the data prep, or the loading)
To prevent that everyone is duplicating the data understanding, it would be nice to have a data description in the repository (data can be download here). That means that every data file(-type) should be described as:
cities/123-3.tsv
- is that a city with ID 123 from themd_cities.tsv
? What's the suffix-3
? - UPDATE: Jan Kulveit will tell us which is the correct one!)Median
means number of infected people", "Timestep
corresponds to a day", ...)Bonus
Hints
{area-type}/{filename}.tsv
-median
or maybe this difference?AC