by-covid / BY-COVID_WP5_T5.2_baseline-use-case

RO-Crate of BY-COVID Baseline Use Case
https://by-covid.github.io/BY-COVID_WP5_T5.2_baseline-use-case/
Creative Commons Attribution 4.0 International
1 stars 2 forks source link
covid-19 ro-crate

BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment

This repository is making an RO-Crate of the BY-COVID WP5 T5.2 baseline use case on "SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection".

RO-Crate preview

The RO-Crate can be browsed at https://by-covid.github.io/BY-COVID_WP5_T5.2_baseline-use-case/ (HTML) and as ro-crate-metadata.json

Description

Background information

Task 5.2 aims to demonstrate how mobilisation of real-world population, health and care data across national borders can provide answers to policy-relevant research questions. Eventually, it aims to prototype a workflow that is standard for population health research. Here, the research question is approached by identifying a causal effect that allows to evaluate a public health intervention. As such, a methodology for approaching causal inference when conducting federated research is proposed and demonstrated, guaranteeing different layers of interoperability (i.e., legal, organisational, and semantic interoperability).

The methodological framework comprises the following steps:

Use case

The current use case aims to answer the following research question: "How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?"

For more information, please consult the study protocol.

Overview of content

The current repository contains the following pieces:

[^readme-1]: For illustrative purposes, the interactive reports contain the output of the scripts of the analytical pipeline applied to the synthetic dataset.

Step-by-step

Conceptual phase: the Common Data Model Specification

The first set of digital objects support conceptually what is needed to develop the analytical pipeline and have been published together as the Common Data Model Specification.

The causal model

A research team constructs a causal model responding to the proposed research question. A Quarto RMarkdown script (vaccine_effectiveness.QMD) produces the structural causal model (DAG) in an interactive HTML report (vaccine_effectiveness_causal_model.html).

The data model specification

A human translates the causal model into data requirements (no technical link) and constructs a CDM in a human-readable version (vaccine_effectiveness_data_model_specification.xlsx). A metadata file compliant to schema.org is produced using dataspice in both human-readable (vaccine_effectiveness_synthetic_dataset_spice.html) and machine-readable (dataspice.json) format.

The synthetic data

The data model specification is translated by a human (no technical link) into a Python script (Jupyter notebook) to generate the synthetic data (by-covid_wp5_baseline_generate_synthetic_data_v.1.1.0.ipynb). The script outputs (technical link) the synthetic data (vaccines_effectiveness_synthetic_dataset_pop_650k.csv). An interactive report with the Exploratory Data Analysis (EDA) of the synthetic data is created using pandas-profiling (now ydata-profiling) in both human-readable (vaccine_effectiveness_synthetic_dataset_eda.html) and machine-readable format (vaccine_effectiveness_synthetic_dataset_eda.json).

Implementation: the Analytical Pipeline

The next set of digital objects are the consecutive scripts of the analytical pipeline. The individual scripts of the analytical pipeline are technically linked to each other. More information on the methodology can be found in the documentation. For illustrative purposes, the interactive reports as output of the analytical pipeline when applied to the synthetic dataset are provided.

Loading of data

Script: 0_global.R

A DuckDB database file is created (BY-COVID-WP5-BaselineUseCase-VE.duckdb). Data are imported from a csv file (e.g. vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv) using the R package Arrow and inserted into the cohort_data database table within the BY-COVID-WP5-BaselineUseCase-VE.duckdb. Data types are manually specified according to the Common Data Model Specification when reading the data using a schema.

Data quality assessment

Script: 1_DQA.QMD

A data quality assessment on the cohort_data is performed and an interactive html report (DQA.html) is created. This report provides an overview of the data and includes dataset statistics, variable types, missing data profiles and potential alerts.

Validation

Script: 2_validation.QMD

The cohort_data are tested against a set of validation rules (as specified in the Common Data Model Specification) and the results of this validation process are summarised in an interactive html report (validation.html). A logical variable flag_violation_val is created in the cohort_data table in the BY-COVID-WP5-BaselineUseCase-VE.duckdb DuckDB database and set to TRUE when at least one of the validation rules in the pre-specified set is violated (otherwise this variable is set to FALSE).

Imputation

Script: 3_imputation.QMD

For each variable in the cohort_data different checks are conducted, based on which a decision is made on how to handle missing values. A logical variable flag_listwise_del is created in the cohort_data table in the BY-COVID-WP5-BaselineUseCase-VE.duckdb DuckDB database and set to TRUE for records for which the value of this variable is missing and the imputation_method=='Listwise deletion where core variable has missing values (MCAR reasonable)'. Imputation of missing values of variables for which is was decided to impute was conducted using the R package mice resulting in an imputed dataset. From this dataset, the records with imputed values are filtered and saved in a separate database table cohort_data_imputed in the BY-COVID-WP5-BaselineUseCase-VE.duckdb DuckDB database. Variables with a high degree of missingness are not included as a matching variable. A report (imputation.html) is generated summarising the results of the different checks and methods used for dealing with missing values.

Matching

Script: 4_matching.QMD (sourcing 4_matching.R)

In the script 4_matching.R, variables needed for the matching are created based on existing variables in the cohort_data and cohort_data_imputed. Records from individuals with a previous infection (previous_infection_bl==TRUE), records violating one of the 'essential' validation rules (flag_violation_val==TRUE) and records set to be listwise deleted (flag_listwise_del==TRUE) are excluded. The matching is conducted using the R package MatchIt. A new table group_similarity is created in the BY-COVID-WP5-BaselineUseCase-VE.duckdb DuckDB database containing for each group_id the 10 nearest matched groups and corresponding distances. The matching algorithm iterates over the set of unique days during the enrollment period at which a newly vaccinated individual (i.e. completing a primary vaccination schedule) is identified. The results obtained for each date are appended to a database table result_matching_alg in the BY-COVID-WP5-BaselineUseCase-VE.duckdb DuckDB database in which one record corresponds to one matched pair. A new table matched_data is subsequently created in the BY-COVID-WP5-BaselineUseCase-VE.duckdb DuckDB database, with two records per match (i.e., one for the case and one for the control). After matching (termination 4_matching.R), the covariate balance is assessed and summarised in an interactive report (matching.html).

Descriptive analysis

Script: 5_descriptives.QMD

The descriptive analysis contains four elements which are reported in descriptive.html: a description of the considered time periods (data extraction period, enrollment period and study period), the results of a survival analysis in the unmatched population (adjusted and unadjusted), a flowchart describing the study population selection (CONSORT diagram) and a table with the baseline characteristics of the matched study population by intervention group.

Survival analysis

Script: 6_survival-analysis.QMD

A survival analysis is conducted in the matched study population matched_data. A hazard ratio (HR), the Restricted Mean Survival Time (RMST) and Restricted Mean Time Lost (RMTL) are reported in survival-analysis.html. Aggregated non-sensitive results for meta-analysis are written to results-survival-analysis-<country>.xlsx.

Getting Started

This analytical pipeline has been developed and tested in R (version 4.2.1) using RStudio desktop as a IDE (version 2022.07.1). To execute the analytical pipeline using the synthetic data or your own input data compliant with the Common Data Model specification, the dependencies and required installation steps are described below.

Dependencies

For testing purposes we assume a similar environment to the development environment.

The development environment included several R packages (from base R or CRAN) and the use of the R project file (.Rproj) included with the scripts. The required R packages and the version used for developing and testing the analytical pipeline:

To run the analytical pipeline with the required dependencies, different methods can be adopted: (1) installing R packages manually, (2) using the renv reproducible environment, (3) running the docker image, or (4) using Conda/Mamba.

1. Installing R packages manually

Obtain source code

Download the ZIP file of the repository using the following link: https://github.com/MarjanMeurisse/BY-COVID_WP5_T5.2_baseline-use-case/archive/refs/heads/main.zip

Extract all from the ZIP file and open the R project file contained within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline in RStudio.

Environment

Install the required R packages. To install a specific version of an R package from source, the following R command can be used (example for the R package dplyr, version 1.1.2):

packageurl <- "https://cran.r-project.org/src/contrib/Archive/dplyr/dplyr_1.1.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")

Execute the analytical pipeline

Input data, compliant with the Common Data Model specification should be provided as the only file within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/input. You can test the analytical pipeline with the synthetic data which are already provided within this folder (vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv) or replace this file with your own real-world data.

The file BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/scripts/analytical-pipeline.QMD can be opened in RStudio and rendered to run the sequential steps in the analytical pipeline. Output files of the analytical pipeline (interactive html reports, xlsx file with aggregated output) are generated withing the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/output.

2. Using the renv reproducible environment

Obtain source code

Download the ZIP file of the repository using the following link: https://github.com/MarjanMeurisse/BY-COVID_WP5_T5.2_baseline-use-case/archive/refs/heads/main.zip

Extract all from the ZIP file and open the R project file contained within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline in RStudio.

Environment

The R project uses renv. The following R command to check which packages are recorded in the lockfile but which are not installed:

renv::status()

Reproduce the testing environment by running the following R command:

renv::restore() 

The metadata from the lockfile are used to install exactly the same version of every package.

Execute the analytical pipeline

Input data, compliant with the Common Data Model specification should be provided as the only file within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/input. You can test the analytical pipeline with the synthetic data which are already provided within this folder (vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv) or replace this file with your own real-world data.

The file BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/scripts/analytical-pipeline.QMD can be opened in RStudio and rendered to run the sequential steps in the analytical pipeline. Output files of the analytical pipeline (interactive html reports, xlsx file with aggregated output) are generated withing the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/output.

3. Using Docker

It is possible to run the analytical pipeline as an isolated application using Docker.

You may skip the docker build command to download the latest container from GitHub.

cd vaccine_effectiveness_analytical_pipeline 
# Enable below if you have modified scripts dependencies
#docker build -t ghcr.io/by-covid/vaccine_effectiveness_analytical_pipeline .
docker run -v `pwd`/input:/pipeline/input -v `pwd`/output:/pipeline/output -it ghcr.io/by-covid/vaccine_effectiveness_analytical_pipeline

Note that when using Docker in this way, file permission on your output folder may not match up with the container's permissions when writing outputs. (tip: chmod -R 777 output)

4. Using Conda/Mamba

Instead of using containers it can be more convenient during development to use a Conda environment. The below assumes Miniconda have been installed and activated. To install the R packages listed in environment.yml, use:

cd vaccine_effectiveness_analytical_pipeline
conda env crate

The above installs most of the R packages from Conda-Forge, avoiding a compilation phase. To install the remaining R packages from CRAN:

conda activate vaccine_effectiveness
Rscript install.R

Finally, to execute the main pipeline using Quarto:

conda activate vaccine_effectiveness
cd scripts
quarto render analytical-pipeline.QMD --execute --output-dir ../output/

This should populate output/ content as a series of HTML files.

Note: The environment.yml is also used by the Dockerfile to install its dependencies, and may have R packages in newer version than listed above

Description output for comparative analysis

An Excel file with multiple sheets, each providing different outputs from the survival analysis, is generated by locally running the analysis pipeline (results-survival-analysis-<country>.xlsx). This Excel file contains information that can be used to compare vaccine effectiveness across sites (see Zenodo publication for local outputs and comparative analysis of three sites).

The excel file contains the following sheets:

The survival probabilities, HR, RMST/RMTL, and ATE are estimated within subgroups determined by the vaccination schedule received by the case in the matched population (Vaccination_schedule):

The survival probabilities, HR, RMST/RMTL, and ATE are estimated within subgroups determined by the NUTS3 residence area of individuals (Residence_area):

Version history

Authors

Funding

By-COVID (Beyond COVID) is a Horizon Europe funded project (101046203).

Acknowledgements

Contact

Marjan Meurisse - marjan.meurisse\@sciensano.be{.email}

Disclaimer

Please, note that we provide these scripts as they are, complying with the specifications of BY-COVID WP5 baseline use case for the purposes and objectives specified within the baseline use case protocol. Software is provided as-is without further support out of the scope of the partners participating in BY-COVID WP5.