Data handling and documentation

stijnvanhoey commented 4 years ago

@JennaVergeynst and @twallema

While preparing the documentation on the new layout of the repository, I'm trying to make sure the data folder gets more structured, see https://github.com/stijnvanhoey/COVID19-Model/tree/cookiecutter#using-data

For the moment I just moved the data into the raw folder, except of the incubation.csv. However, I'm not sure if this is correct and I got some additional questions:

The notebook DataExtraction does not write anything to disk, so is this effectively used in the analysis?
I think @CyrilGarneau added the economical data sets? What are the URLS or download points for each of them? Are these the raw formats as downloaded from the website? Is my grouping into this economical directory a good division?
The Interaction_matrices data is coming from https://lwillem.shinyapps.io/socrates_rshiny/ according to the notebook. Have they been downloaded manually? Are these the raw formats or as there been any transformation done already? Could we maybe download them by using code and write small snippet for it?
The incubation.csv data set appartenly is coming from incubation period is assumed to be Erlang distributed as reported by Li et al. (2020a). Is there a small code snippet of the creation somewhere? Is it actually required to have it as a data file or can we do the extraction also by creating a function that samples the distribution, eg using https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.erlang.html?
Are the data sets Age pyramid of Belgium.csv, contacts.Rdata and imperialCollegeAgeDist.csv actually used, or can these be removed?

├── interim
│   └── incubation.csv
├── raw
│   ├── Age pyramid of Belgium.csv
│   ├── contacts.Rdata
│   ├── imperialCollegeAgeDist.csv
│   ├── economical
│   │   ├── Employment - annual detailed data - Domestic concept - A38.xlsx
│   │   ├── GDP_Belgium_per sector.xlsx
│   │   ├── input-output.xlsx
│   │   ├── Sectoral_data.xlsx
│   │   ├── Staff distribution by sector.xlsx
│   │   └── Supply and use table - Belgium.xlsx
│   └── Interaction_matrices
│       ├── Belgium
│       ├── France
│       ├── GBR
│       ├── Germany
│       ├── Italy
│       ├── Spain
│       └── USA

JennaVergeynst commented 4 years ago

Information of Age pyramid of Belgium.csv is hardcoded in the file model.py (FullPop, StudentPop, ElderPop with as limits 20 and 70 years). @twallema might need these data still for the age-layered model possibly?
Sectoral_data.xlsx we received via mail from Gert: he "downloaded sectoral data for i) value added and ii) employment that could be used for your calibrations. The sources are the "national accounts" and "employment" statistics of the National Bank of Belgium (NBB). These are annual values for 2018 (2019 is not yet available). In the spreadsheet, I have also included the value added for 2015 (in case you want to compare with input-output below)".
input-output data idem, but this can also be directly downloaded from source: https://www.plan.be/databases/io2015/vr64_en_20181217.xlsx (only constructed each 5 years, so these are for 2015)
staff distribution is data collected by the GEES (also received via mail) and represents the situation mid-april. (The data shows for each sector the staff distribution working from home (telework), at workplace, being unemployed, etc. The sectors cover about 70% of total private employment)
other data I don't know... @CyrilGarneau ? Remaining questions need a look from @twallema

JennaVergeynst commented 4 years ago

These data descriptions should probably come in a readme with a section on data description, I suppose?

jorisvandenbossche commented 4 years ago

These data descriptions should probably come in a readme with a section on data description, I suppose?

Or a dedicated README in the /data directory

stijnvanhoey commented 4 years ago

As @jorisvandenbossche mentions, https://github.com/stijnvanhoey/COVID19-Model/tree/cookiecutter/data is currently prepared in the PR. I should have mentioned that.

Putting these descriptions inside the general readme can cause a quick overload of info in the general readme, so rather pt it close to the data.

twallema commented 4 years ago

@stijnvanhoey 1) The DataExtraction notebook is not used for analysis, it is a demo of Sciensano data extraction, so it's a form of 'documentation' (?). 2) Grouping of economic files is correct. 3) Contact data originates from the file 'contacts.Rdata' , which was made public in the following publication: https://www.thelancet.com/journals/lanpub/article/PIIS2468-2667(20)30073-6/fulltext . With regard to our age-layered deterministic model, this is a replication of said paper in the Lancet.

twallema commented 4 years ago

4) I think the Erlang parameters were not given in Li. et al so I decided to convert the figure of the distribution into a csv using an online tool. Next, the following code is used to sample from the distribution.

    def sampleFromDistribution(self,filename,k):
        df = pd.read_csv(filename)
        x = df.iloc[:,0]
        y = df.iloc[:,1]
        return(numpy.asarray(choices(x, y, k = k)))

First, this is not very elegant. Second, the use of this distribution will most likely be changed or omitted in future work. Ideally, there would be a non-hardcoded option to sample selected parameters from a gamma/erlang distribution.

5) The age pyramid of Belgium is used by the economic model by Cyril but may be omitted. Imperial college age distributed parameters are (not yet) used but should be retained.

JennaVergeynst commented 4 years ago

@twallema I think most of this issue has been solved? Or any elements that still need some work? If not, can be closed

UGentBiomath / COVID19-Model

Data handling and documentation #32