UGentBiomath / COVID19-Model

Compartmental SEIQRD model to model the effects of government policies on SARS-CoV-2 spread in Belgium. Macro-economic Input-Output model to assess the economic impact of sector closure and changes in consumption patterns. Quality-adjusted life-years model to assess the health economic impact of SARS-CoV-2.
MIT License
23 stars 30 forks source link

Data handling and documentation #32

Closed stijnvanhoey closed 4 years ago

stijnvanhoey commented 4 years ago

@JennaVergeynst and @twallema

While preparing the documentation on the new layout of the repository, I'm trying to make sure the data folder gets more structured, see https://github.com/stijnvanhoey/COVID19-Model/tree/cookiecutter#using-data

For the moment I just moved the data into the raw folder, except of the incubation.csv. However, I'm not sure if this is correct and I got some additional questions:

├── interim
│   └── incubation.csv
├── raw
│   ├── Age pyramid of Belgium.csv
│   ├── contacts.Rdata
│   ├── imperialCollegeAgeDist.csv
│   ├── economical
│   │   ├── Employment - annual detailed data - Domestic concept - A38.xlsx
│   │   ├── GDP_Belgium_per sector.xlsx
│   │   ├── input-output.xlsx
│   │   ├── Sectoral_data.xlsx
│   │   ├── Staff distribution by sector.xlsx
│   │   └── Supply and use table - Belgium.xlsx
│   └── Interaction_matrices
│       ├── Belgium
│       ├── France
│       ├── GBR
│       ├── Germany
│       ├── Italy
│       ├── Spain
│       └── USA
JennaVergeynst commented 4 years ago
JennaVergeynst commented 4 years ago

These data descriptions should probably come in a readme with a section on data description, I suppose?

jorisvandenbossche commented 4 years ago

These data descriptions should probably come in a readme with a section on data description, I suppose?

Or a dedicated README in the /data directory

stijnvanhoey commented 4 years ago

As @jorisvandenbossche mentions, https://github.com/stijnvanhoey/COVID19-Model/tree/cookiecutter/data is currently prepared in the PR. I should have mentioned that.

Putting these descriptions inside the general readme can cause a quick overload of info in the general readme, so rather pt it close to the data.

twallema commented 4 years ago

@stijnvanhoey 1) The DataExtraction notebook is not used for analysis, it is a demo of Sciensano data extraction, so it's a form of 'documentation' (?). 2) Grouping of economic files is correct. 3) Contact data originates from the file 'contacts.Rdata' , which was made public in the following publication: https://www.thelancet.com/journals/lanpub/article/PIIS2468-2667(20)30073-6/fulltext . With regard to our age-layered deterministic model, this is a replication of said paper in the Lancet.

twallema commented 4 years ago

4) I think the Erlang parameters were not given in Li. et al so I decided to convert the figure of the distribution into a csv using an online tool. Next, the following code is used to sample from the distribution.

    def sampleFromDistribution(self,filename,k):
        df = pd.read_csv(filename)
        x = df.iloc[:,0]
        y = df.iloc[:,1]
        return(numpy.asarray(choices(x, y, k = k)))

First, this is not very elegant. Second, the use of this distribution will most likely be changed or omitted in future work. Ideally, there would be a non-hardcoded option to sample selected parameters from a gamma/erlang distribution.

5) The age pyramid of Belgium is used by the economic model by Cyril but may be omitted. Imperial college age distributed parameters are (not yet) used but should be retained.

JennaVergeynst commented 4 years ago

@twallema I think most of this issue has been solved? Or any elements that still need some work? If not, can be closed