David MacDonald
This project involves using ComBat, an open-source library for multi-site data harmonization, to remove site effects from subcortical volumetric data derived from a subset of the ABIDE dataset, and to compare this method to using linear models and mixed effects models without harmonization.
I have been working with a large, multi-site dataset to examine subcortical anatomy in autism. Multi-site MRI datasets can be difficult to work with, as scans are typically acquired on different equipment, using different protocols, by different operators. In addition, different sites may be sampling different populations. This can result in a dataset in which a significant portion of the variance is due to non-biological factors. Up to now, I have been using meta-analytic techniques to combine the data from the different sites.
The purpose of this project is to learn and apply two other techniques for inter-site data harmonization:
Other goals are to:
Tools learned and used during this project include:
ComBat is a tool for mitigating batch effects in genomic data. It has since been adapted for use in neuroimaging, and is now available as an open source library in R, Python, and MATLAB for use in multi-site neuroimaging studies. ComBat harmonizes data by fitting a linear model, with location and scale (L/S) terms accounting for site-specific differences, to the data, then reconstructing each data-point without the location and scale terms. These terms are fit using an Empirical Bayes approach, allowing pooling across features to improve the estimate of site effects.
This project made use of subcortical volumes, previously derived from a subset (n = 359, across three sites and two releases) of the ABIDE dataset using the MAGeT Brain pipeline.
The workflow encapsulated in the pipeline and jupyter notebook are described in the figure below. T1w structural MRI scans from several sites in the ABIDE dataset were preprocessed, then segmented using the MAGeTBrain segmentation pipeline, which provided volumes for left and right striatum, thalamus, and globus pallidus. The pipeline developed here then masked out volumes for structures that did not pass quality control checks (QC). This data was examined using interactive visualizations in Python and Jupyter notebook. The same masked data was processed in three streams by the pipeline. In all three cases, linear models were used to quantify the effect of an autism spectrum disorder diagnosis on the volumes of the subcortical structures listed above. Age, sex, and total brain volume were used as covariates.
In the first stream, linear models were fit, as shown in the figure, to the unharmonized data, with imaging site added as a covariate. In the second, linear mixed effects models were fit, with site as a random effect (random intercept). In the third, the raw volumes were harmonized using ComBat, and linear models were fit to this harmonized data. The harmonization included age and sex as covariates, to preserve variation due to those factors. Because ComBat was expected to remove site-specific effects, site was not included as a covariate in these models.
Cohen's d effect sizes were computed from the linear models in all three streams. These were used to generate a forest plot, to allow comparison between the three methods.
Note that the Jupyter notebook also supports the examination of the harmonized data.
Data harmonization and effect size visualization were done in R, where the available tools were more sophisticated.
Initially, the project was meant to run entirely in Python. However, two challenges arose. First, the current Python version of neuroCombat is not able to accept data with missing values. Since this data is masked based on segmentation quality (structures whose segementation failed are not used), the data contains missing values. The R version of neuroCombat does support missing values. Also, R has much more sophisticated packages available to generate forest plots, which are used here to compare the results of the different harmonization methods. Since the language of the Brainhack School is Python, the project was reconceived as a mini-pipeline, using both R and Python.
Jupyter notebooks are used in this project both for data visualization and for presentation (using RISE). Initially, data harmonization was done in Python, and all of the code was run inside of Jupyter notebooks. When QC masking was added, it was necessary to move the ComBat harmonization code to R. For this reason, the Jupyter Notebooks depend on having access to the harmonized data from the pipeline. Several interactive visualizations are provided in the Jupyter notebooks, and conda environments are provided to allow them to be run on other machines without version conflicts.
neurodocker 0.7.0 was used to create the dockerfile that contained the specifications for the container. This was fairly straightforward:
The docker container was configured to run the pipeline bash script at startup in non-interactive mode. Input and output directories are set on the command line. All code that is run in the Docker container is provided on the command line (i.e. it has not been built in to the container). This is to allow for modifications, for example to use it on a different dataset, while maintaining the same environment.
Note that the deliverables changed somewhat over the course of the project, and are more extensive than the original conception.
The Github repository contains the following:
The structure of the repository is shown below:
File/Directory | Content |
---|---|
code | Jupyter notebooks and pipeline code |
docker | Contains scripts used to create docker container |
images | Images used in this README.md |
input | Default input directory for the pipeline. Includes input data and copy of pipeline code. |
output | Default output directory for the pipeline |
Week_3_Deliverables_Data_Visualization | Deliverables for Week 3 of Brainhack School |
harmonization.yml | conda environment description for the pipeline and the Jupyter notebooks |
The pipeline takes as input a .csv file containing the data: volumes for each subcortical structure for each participant, as well as ASD diagnosis, Age, Sex, Total Brain Volume, Imaging Site, and Quality Control (QC) values describing the quality of each segmentation. Its output consists of two files: a .csv file in which the subcortical volumes and total brain volume have been harmonized using ComBat, and a forest plot showing the effect sizes of diagnosis on subcortical volumes, while controlling for Age, Sex, and Total Brain Volume. The forest plot shows the results of a linear model fit on the harmonized data, a linear model fit on the unharmonized data with Site as a covariate, and a linear mixed-effects model fit on the unharmonized data with Site as a random factor (random intercept).
The pipeline consists of a bash script, which calls several R and Python scripts to harmonize the input data using ComBat, fit linear models to the harmonized and unharmonized data, fit linear mixed effect models with site as a random factor to the unharmonized data, compute effect size measures in all three cases, and display the effect sizes and confidence intervals obtained in a forest plot for comparison. It consists of: | File | Description |
---|---|---|
harmonize_cmd.sh | Contains the bash command used to start the pipeline. This script is called on startup by the Docker container. All command-line options are specified here. | |
harmonize_and_show_effect_sizes.sh | Bash script that manages data flow through the pipeline | |
harmonize_data_prep.py | Python script to load the data, remove rows with missing values, and mask the dependent variables according to quality control (QC) results | |
harmonize.R | R script that takes the masked data and performs ComBat harmonization | |
harmonize_fit_models.py | Python script that fits linear models to both harmonized and unharmonized data, adding site as a covariate to the unharmonized models, and linear mixed effects models on the unharmonized data, with site as a random factor. | |
forest.R | R script that operates on the effect size values computed in the previous step. Saves a forest plot comparing the three methods on each of the dependent variables. |
Note that, although the pipeline is here run non-interactively inside a Docker container, each segment is built in such a way that it can be run independently, and the command line arguments can be specified at runtime.
Jupyter notebooks were written to facilitate interactive data exploration. Several interactive plots were created, using seaborn and ipywidgets. These plots allow the user to explore both the harmonized and unharmonized distributions, overlaid with different covariates (such as site and diagnosis). Note that these notebooks require harmonized data from the pipeline, as the Python version of ComBat is not able to work with masked data.
The conda environment file harmonization.yml is included in the repository, allowing the Python environment used in this project to be recreated. Note that the R environment was not virtualized in the same way, necessitating the use of Docker.
The pipeline uses both Python and R. While virtualizing Python environments is readily done with conda, it is more difficult to do so with R. For this reason, the entire environment was constructed in a Docker container in which the pipeline runs. See above for more details. The docker folder in the repository contains:
Note that this container does NOT contain the data or code. When you run it (see below), you will need to specify the input and output directories, and the input directory must contain the pipeline code and the data. This was done to allow for greater flexibility, to make it easier to use the pipeline with other data or to modify it. For maximum reproducibility, it would make sense to produce a Docker container that incorporates the pipeline code as well.
To run the code yourself, you will need to have Docker and conda installed on your system. To run the pipeline:
cd input
docker run --rm -v path_to_repository/input/:/input/:ro -v path_to_repository/output/:/output/ dnmacdon/harmonizer:environment_only
./harmonize_and_show_effect_sizes.sh -i /input/data.csv \ -o /output \ -s Site \ -x DX \ -c Age,Sex,TBV \ -l Control \ -q "L_str,L_thal,L_GP,R_str,R_GP,R_thal" \ -z "Age,Sex" \ -t 0.5
where the options are: | Argument | Meaning |
---|---|---|
-i | Input filename | |
-o | Output directory | |
-s | Site / Name of feature to harmonize | |
-x | Name of independant variable / regressor, here diagnosis. Should be categorical | |
-l | Name of control condition for independant variable | |
-c | Comma-separated names of covariates for linear models | |
-q | Names of QC variables / columns | |
-t | QC threshold. All features below this QC value will be masked | |
-z | ComBat covariates: covariates for data harmonization, not linear modeling |
The dependent variables must be named the same as the QC variables, with the suffix _vol. For example, L_str and L_str_vol. They are not specified on the command line.
If you wish to build the Docker container that was built in this project:
./bdf.sh
docker build -t harmonizer .
Please be aware that building the container will use current versions of the software and libraries, which may affect results. Building the container should not be necessary to reproduce the results reported here. Only rebuild the container if you wish to make changes to the pipeline.If you wish to run the Jupyter notebook for data exploration and presentation:
conda env create -f harmonization.yml && conda activate harmonization
jupyter notebook
A number of improvements are possible.
Thanks to all of the instructors, teaching assistants, and participants of the Brainhack School 2020! Thanks to the organizers, J.B. Poline, Pierre Belec, Tristan Glatard, and Benjamin de Leener, with particular thanks to Pierre for his patience.