This repository is making an RO-Crate of the BY-COVID WP5 T5.2 baseline use case on "SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection".
The RO-Crate can be browsed at https://by-covid.github.io/BY-COVID_WP5_T5.2_baseline-use-case/ (HTML) and as ro-crate-metadata.json
Task 5.2 aims to demonstrate how mobilisation of real-world population, health and care data across national borders can provide answers to policy-relevant research questions. Eventually, it aims to prototype a workflow that is standard for population health research. Here, the research question is approached by identifying a causal effect that allows to evaluate a public health intervention. As such, a methodology for approaching causal inference when conducting federated research is proposed and demonstrated, guaranteeing different layers of interoperability (i.e., legal, organisational, and semantic interoperability).
The methodological framework comprises the following steps:
The current use case aims to answer the following research question: "How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?"
For more information, please consult the study protocol.
The current repository contains the following pieces:
[^readme-1]: For illustrative purposes, the interactive reports contain the output of the scripts of the analytical pipeline applied to the synthetic dataset.
The first set of digital objects support conceptually what is needed to develop the analytical pipeline and have been published together as the Common Data Model Specification.
A research team constructs a causal model responding to the proposed research question. A Quarto RMarkdown script (vaccine_effectiveness.QMD
) produces the structural causal model (DAG) in an interactive HTML report (vaccine_effectiveness_causal_model.html
).
A human translates the causal model into data requirements (no technical link) and constructs a CDM in a human-readable version (vaccine_effectiveness_data_model_specification.xlsx
). A metadata file compliant to schema.org is produced using dataspice in both human-readable (vaccine_effectiveness_synthetic_dataset_spice.html
) and machine-readable (dataspice.json
) format.
The data model specification is translated by a human (no technical link) into a Python script (Jupyter notebook) to generate the synthetic data (by-covid_wp5_baseline_generate_synthetic_data_v.1.1.0.ipynb
). The script outputs (technical link) the synthetic data (vaccines_effectiveness_synthetic_dataset_pop_650k.csv
). An interactive report with the Exploratory Data Analysis (EDA) of the synthetic data is created using pandas-profiling (now ydata-profiling) in both human-readable (vaccine_effectiveness_synthetic_dataset_eda.html
) and machine-readable format (vaccine_effectiveness_synthetic_dataset_eda.json
).
The next set of digital objects are the consecutive scripts of the analytical pipeline. The individual scripts of the analytical pipeline are technically linked to each other. More information on the methodology can be found in the documentation. For illustrative purposes, the interactive reports as output of the analytical pipeline when applied to the synthetic dataset are provided.
Script: 0_global.R
vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv
cohort_data
A DuckDB database file is created (BY-COVID-WP5-BaselineUseCase-VE.duckdb
). Data are imported from a csv file (e.g. vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv
) using the R package Arrow
and inserted into the cohort_data
database table within the BY-COVID-WP5-BaselineUseCase-VE.duckdb
. Data types are manually specified according to the Common Data Model Specification when reading the data using a schema.
Script: 1_DQA.QMD
cohort_data
DQA.html
A data quality assessment on the cohort_data
is performed and an interactive html report (DQA.html
) is created. This report provides an overview of the data and includes dataset statistics, variable types, missing data profiles and potential alerts.
Script: 2_validation.QMD
cohort_data
cohort_data
including flag_violation_val
validation.html
The cohort_data
are tested against a set of validation rules (as specified in the Common Data Model Specification) and the results of this validation process are summarised in an interactive html report (validation.html
). A logical variable flag_violation_val
is created in the cohort_data
table in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database and set to TRUE
when at least one of the validation rules in the pre-specified set is violated (otherwise this variable is set to FALSE
).
Script: 3_imputation.QMD
cohort_data
cohort_data
including flag_listwise_del
cohort_data_imputed
imputation_methods
imputation.html
For each variable in the cohort_data
different checks are conducted, based on which a decision is made on how to handle missing values. A logical variable flag_listwise_del
is created in the cohort_data
table in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database and set to TRUE
for records for which the value of this variable is missing and the imputation_method=='Listwise deletion where core variable has missing values (MCAR reasonable)'
. Imputation of missing values of variables for which is was decided to impute was conducted using the R package mice
resulting in an imputed dataset. From this dataset, the records with imputed values are filtered and saved in a separate database table cohort_data_imputed
in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database. Variables with a high degree of missingness are not included as a matching variable. A report (imputation.html
) is generated summarising the results of the different checks and methods used for dealing with missing values.
Script: 4_matching.QMD
(sourcing 4_matching.R
)
cohort_data
including flag_violation_val
and flag_listwise_del
cohort_data_imputed
imputation_methods
group_similarity
result_matching_alg
matched_data
matching.html
In the script 4_matching.R, variables needed for the matching are created based on existing variables in the cohort_data
and cohort_data_imputed
. Records from individuals with a previous infection (previous_infection_bl==TRUE
), records violating one of the 'essential' validation rules (flag_violation_val==TRUE
) and records set to be listwise deleted (flag_listwise_del==TRUE
) are excluded. The matching is conducted using the R package MatchIt
. A new table group_similarity
is created in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database containing for each group_id
the 10 nearest matched groups and corresponding distances. The matching algorithm iterates over the set of unique days during the enrollment period at which a newly vaccinated individual (i.e. completing a primary vaccination schedule) is identified. The results obtained for each date are appended to a database table result_matching_alg
in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database in which one record corresponds to one matched pair. A new table matched_data
is subsequently created in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database, with two records per match (i.e., one for the case and one for the control). After matching (termination 4_matching.R), the covariate balance is assessed and summarised in an interactive report (matching.html
).
Script: 5_descriptives.QMD
cohort_data
including flag_violation_val
and flag_listwise_del
cohort_data_imputed
matched_data
imputation_methods
descriptive.html
The descriptive analysis contains four elements which are reported in descriptive.html
: a description of the considered time periods (data extraction period, enrollment period and study period), the results of a survival analysis in the unmatched population (adjusted and unadjusted), a flowchart describing the study population selection (CONSORT diagram) and a table with the baseline characteristics of the matched study population by intervention group.
Script: 6_survival-analysis.QMD
matched_data
survival-analysis.html
results-survival-analysis-<country>.xlsx
A survival analysis is conducted in the matched study population matched_data
. A hazard ratio (HR), the Restricted Mean Survival Time (RMST) and Restricted Mean Time Lost (RMTL) are reported in survival-analysis.html
. Aggregated non-sensitive results for meta-analysis are written to results-survival-analysis-<country>.xlsx
.
This analytical pipeline has been developed and tested in R (version 4.2.1) using RStudio desktop as a IDE (version 2022.07.1). To execute the analytical pipeline using the synthetic data or your own input data compliant with the Common Data Model specification, the dependencies and required installation steps are described below.
For testing purposes we assume a similar environment to the development environment.
The development environment included several R packages (from base R or CRAN) and the use of the R project file (.Rproj) included with the scripts. The required R packages and the version used for developing and testing the analytical pipeline:
To run the analytical pipeline with the required dependencies, different methods can be adopted: (1) installing R packages manually, (2) using the renv reproducible environment, (3) running the docker image, or (4) using Conda/Mamba.
Download the ZIP file of the repository using the following link: https://github.com/MarjanMeurisse/BY-COVID_WP5_T5.2_baseline-use-case/archive/refs/heads/main.zip
Extract all from the ZIP file and open the R project file contained within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline
in RStudio.
Install the required R packages. To install a specific version of an R package from source, the following R command can be used (example for the R package dplyr, version 1.1.2):
packageurl <- "https://cran.r-project.org/src/contrib/Archive/dplyr/dplyr_1.1.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")
Input data, compliant with the Common Data Model specification should be provided as the only file within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/input
. You can test the analytical pipeline with the synthetic data which are already provided within this folder (vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv
) or replace this file with your own real-world data.
The file BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/scripts/analytical-pipeline.QMD
can be opened in RStudio and rendered to run the sequential steps in the analytical pipeline. Output files of the analytical pipeline (interactive html reports, xlsx file with aggregated output) are generated withing the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/output
.
Download the ZIP file of the repository using the following link: https://github.com/MarjanMeurisse/BY-COVID_WP5_T5.2_baseline-use-case/archive/refs/heads/main.zip
Extract all from the ZIP file and open the R project file contained within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline
in RStudio.
The R project uses renv. The following R command to check which packages are recorded in the lockfile but which are not installed:
renv::status()
Reproduce the testing environment by running the following R command:
renv::restore()
The metadata from the lockfile are used to install exactly the same version of every package.
Input data, compliant with the Common Data Model specification should be provided as the only file within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/input
. You can test the analytical pipeline with the synthetic data which are already provided within this folder (vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv
) or replace this file with your own real-world data.
The file BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/scripts/analytical-pipeline.QMD
can be opened in RStudio and rendered to run the sequential steps in the analytical pipeline. Output files of the analytical pipeline (interactive html reports, xlsx file with aggregated output) are generated withing the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/output
.
It is possible to run the analytical pipeline as an isolated application using Docker.
You may skip the docker build
command to download the latest container from GitHub.
cd vaccine_effectiveness_analytical_pipeline
# Enable below if you have modified scripts dependencies
#docker build -t ghcr.io/by-covid/vaccine_effectiveness_analytical_pipeline .
docker run -v `pwd`/input:/pipeline/input -v `pwd`/output:/pipeline/output -it ghcr.io/by-covid/vaccine_effectiveness_analytical_pipeline
Note that when using Docker in this way, file permission on your output
folder may not match up with the container's permissions when writing outputs. (tip: chmod -R 777 output
)
Instead of using containers it can be more convenient during development to use a Conda environment. The below assumes Miniconda have been installed and activated. To install the R packages listed in environment.yml
, use:
cd vaccine_effectiveness_analytical_pipeline
conda env crate
The above installs most of the R packages from Conda-Forge, avoiding a compilation phase. To install the remaining R packages from CRAN:
conda activate vaccine_effectiveness
Rscript install.R
Finally, to execute the main pipeline using Quarto:
conda activate vaccine_effectiveness
cd scripts
quarto render analytical-pipeline.QMD --execute --output-dir ../output/
This should populate output/
content as a series of HTML files.
Note: The environment.yml
is also used by the Dockerfile
to install its dependencies, and may have R packages in newer version than listed above
An Excel file with multiple sheets, each providing different outputs from the survival analysis, is generated by locally running the analysis pipeline (results-survival-analysis-<country>.xlsx
). This Excel file contains information that can be used to compare vaccine effectiveness across sites (see Zenodo publication for local outputs and comparative analysis of three sites).
The excel file contains the following sheets:
The survival probabilities, HR, RMST/RMTL, and ATE are estimated within subgroups determined by the vaccination schedule received by the case in the matched population (Vaccination_schedule):
The survival probabilities, HR, RMST/RMTL, and ATE are estimated within subgroups determined by the NUTS3 residence area of individuals (Residence_area):
v.1.0.0: Initial iteration of the analytical pipeline scripts
v.1.0.1: Minor adjustments
v.1.0.2:
A log file (logfile.txt) is created in the ./logs folder (system settings, timing, errors)
Sex and age group are no longer handled as continuous variables
Implement handling large proportions of missing data in core variables
Add aggregated output for meta-analysis
Matching based on individual-level SES when available
Adjusted hover in plotly graphs
Additional analysis implemented: survival in subgroups determined by vaccination schedule
Adjusted documentation
By-COVID (Beyond COVID) is a Horizon Europe funded project (101046203).
Marjan Meurisse - marjan.meurisse\@sciensano.be{.email}
Please, note that we provide these scripts as they are, complying with the specifications of BY-COVID WP5 baseline use case for the purposes and objectives specified within the baseline use case protocol. Software is provided as-is without further support out of the scope of the partners participating in BY-COVID WP5.