as it is currently coded, the population data and population structure is read from data in different ways:
in scot_data: the total population size of is provided for each Scottish health boards and for Scotland (first column) together with the daily cumulated number of cases
in scot_death: the total population size of is provided for each Scottish health boards and for Scotland is also recorded in the first column together with the daily cumulated number of deaths. This data is a duplication of what is recorded in scot_data. scot_death is however not required in the prediction mode.
in scot_age: the age structure of each age group is given for each Scottish health boards and for Scotland.
in parameters.ini: the total number of HCW in Scotland is given and postprocessing is done within the model to compute the number of HCW in the health board as a direct function of the number of inhabitant in health board.
These features have long term issues:
there is duplication of information
it limits our ability to expand the use of the model to other health boards across the UK
it limits our ability to perform cross validation and UQ.
it complicates the integration with the pipeline.
it limits our ability to use better information on HCW
I am proposing to reform our actual input data structure to simplify data requirements:
total population size and population of HCW should be provided in the parameters.ini or directly informed by the data pipeline
age structure of the study population should be given in scot_data but not together with other population (that is only one seed of values, corresponding to the proportion of people in the region of interest per age group)
the first column of scot_data and scot_death should be removed and therefore both will not be needed in the prediction mode.
scot_data and scot_death could be combined to give the epidemiological data of the region of interest (either health board A, B, ..., Scotand, UK) that are required for inference within a single file in a structure such as:
| day | cumul_n_cases | cumul_n_death |
as it is currently coded, the population data and population structure is read from data in different ways:
scot_data
: the total population size of is provided for each Scottish health boards and for Scotland (first column) together with the daily cumulated number of casesscot_death
: the total population size of is provided for each Scottish health boards and for Scotland is also recorded in the first column together with the daily cumulated number of deaths. This data is a duplication of what is recorded inscot_data
.scot_death
is however not required in theprediction
mode.scot_age
: the age structure of each age group is given for each Scottish health boards and for Scotland.parameters.ini
: the total number of HCW in Scotland is given and postprocessing is done within the model to compute the number of HCW in the health board as a direct function of the number of inhabitant in health board.These features have long term issues:
I am proposing to reform our actual input data structure to simplify data requirements:
parameters.ini
or directly informed by the data pipelinescot_data
but not together with other population (that is only one seed of values, corresponding to the proportion of people in the region of interest per age group)scot_data
andscot_death
should be removed and therefore both will not be needed in the prediction mode.scot_data
andscot_death
could be combined to give the epidemiological data of the region of interest (either health board A, B, ..., Scotand, UK) that are required for inference within a single file in a structure such as: | day | cumul_n_cases | cumul_n_death |