LeonDLotter / ABAnnotate

A toolbox for ensemble-based multimodal gene-category enrichment analysis of human neuroimaging data
GNU General Public License v3.0
18 stars 2 forks source link

ABAnnotate - a toolbox for ensemble-based multimodal gene-category enrichment analysis of human neuroimaging data

Are you interested in contextualising brain maps, maybe derived from case-control-comparisons, fMRI tasks, or spatial meta-analysis, across biological systems ranging from molecular and cell levels to disease-associations? ABAnnotate uses spatial gene expression patterns to derive neuroimaging phenotype-gene associations and asses the overrepresentation of associated genes in several multimodal gene-category datasets.

DOI
License: GNU General Public License v3.0

(Note: ABAnnotate inherited its license from its source toolbox. Integrated datasets, especially data from the Allen Institute for Brain Science, are licensed under non-commercial licenses which is to be considered when using ABAnnotate.)


ABAnnotate is a Matlab-based toolbox to perform ensemble-based gene-category enrichment analysis (GCEA) on volumetric human neuroimaging data via brain-wide gene expression patterns derived from the Allen Human Brain Atlas (ABA). It applies a nonparametric method developed by Fulcher et al. (2021) using spatial autocorrelation-corrected phenotype null maps for the estimation of gene-category null ensembles. ABAnnotate was adopted from Fulcher et al.`s toolbox which was originally designed for annotation of imaging data to GeneOntology categories. The function to generate null models, along with some utility functions, were taken from the JuSpace toolbox by Dukart et al. (2021).

ABAnnotate is under development. It works of the box but you may well encounter bugs when using it. Please feel free to report these by opening an issue or contacting me.


Content:


Method

The method basically consists of the following steps:

  1. An input volume ("phenotype") is parcellated according to given parcellation and null models corrected for spatial autocorrelation are generated (if autocorrelation was detected in the data).
  2. For each null phenotype and each gene category a "category score" is obtained by correlating the null phenotype with the spatial mRNA expression pattern of each gene and averaging the z-transformed correlation coefficients off all genes annotated to a certain category within each category (= null categories).
  3. The generated null category scores are then compared to the "real" category score obtained by correlating the "real" phenotype with all genes in each category and averaging the correlation coefficients per category.
  4. One-sided p values for each category are obtained from the estimated null distribution of category scores and the resulting p-values are FDR-corrected.

ABAnnotate extends Fulcher et al.'s toolbox by:

  1. shifting the focus away from GeneOntology categories to an independent form allowing for the integration of any dataset annotating genes to some kind of categories and
  2. increasing user-friendliness through integration of brain parcellations, associated ABA mRNA expression data, automated neuroimaging null volume generation, and multiple GCEA datasets annotating genes to functional, disease-related, developmental, and neurobiological categories.

Datasets

All datasets (atlases, ABA data, GCEA datasets) are stored on an OSF server. Source information is provided in dataset_sources.csv which can be loaded and updated from OSF via:

sources_table = abannotate_get_sources;

ABAnnotate automatically downloads selected datasets to the two folders \atlas (parcellation volumes and parcel-wise ABA data) and \datasets (GCEA datasets). You can also download the data manually from OSF and save it in the respective folders.

Atlases & ABA data

The toolbox relies heavily on ABA data which was imported through the abagen toolbox using the default settings. For each parcellation, there is an associated {atlas_name}_report.md file with information on the processing done by abagen.
Currently, three parcellations are implemented: A functionally defined parcellation combined from 100 cortical (Schaefer et al., 2018) and 16 subcortical parcels (Tian et al., 2020), a second version of this parcellation with only the 100 cortical parcels, and the anatomically defined whole-brain Neuromorphometrics atlas (8 regions without ABA data (31, 72, 118, 121, 148, 149, 156, 174) were removed: 111 parcels).
See example/customization.md for information on how to import your own ABA data (e.g., if you want to use a custom parcellation or alter ABA mRNA expression data processing).

GCEA datasets

Current GCEA datasets include:

To get a list of all available GCEA datasets run:

abannotate_get_datasets;

Output:

Available GCEA datasets:
- ABA-brainSpan-weights
- DAVID-chromosome-discrete
- DAVID-cytogenicLocation-discrete
- DisGeNET-diseaseCuratedAll-discrete
- DisGeNET-diseaseCuratedMental-discrete
- DisGeNET-diseaseAllAll-discrete
- DisGeNET-diseaseAllMentalBehav-discrete
- GO-biologicalProcessDirect-discrete
- GO-biologicalProcessProp-discrete
- GO-molecularFunctionDirect-discrete
- GO-molecularFunctionProp-discrete
- GO-cellularComponentDirect-discrete
- GO-cellularComponentProp-discrete
- PsychEncode-cellTypesTPM-discrete
- PsychEncode-cellTypesUMI-discrete

Please note that, while ABAnnotate is published under a GPL-3.0 license which allows for commercial use, associated datasets are protected by other licences (e.g., ABA data may not be used commercially, DisGeNET data are protected under a CC BY-NC-SA 4.0 license). If available, these licensed are listed in dataset_sources.csv. This effectively renders ABAnnotate, if used as is, unsuitable for commercial use!

Dependencies

ABAnnotate was coded in Matlab R2021a. For generation of phenotype null maps, it depends on the SPM12 image calculator. It uses the Parallel Processing toolbox for generation of null phenotypes and correlation calculation. It requires an internet connection to download parcellations, ABA data and GCEA datasets from OSF.

Usage

Simple

The simplest use case requires only a NIfTI volume in MNI space and the selection of one of the GCEA datasets provided with ABAnnotate. The below code will perform a GCEA on an input volume with 1000 null maps corrected for spatial autocorrelation using GeneOntology "Biological Process" categories with annotated genes propagated upwards through the GeneOntology hierarchy (as opposed to only using direct annotations between categories and genes); phenotype-gene associations will be computed using Spearman correlations and category scores will estimated as average r-to-Z-transformed correlation coefficients.

Download the toolbox and add it to the matlab path:

startup;

All Options are defined in a struct array:

opt.analysis_name = 'GCEA_GeneOntology'; % name for analysis
opt.phenotype = '/path/to/input/volume.nii'; % input "phenotype" volume
opt.dir_result = '/path/to/save/output'; % output directory
opt.GCEA.dataset = 'GO-biologicalProcessProp-discrete'; % selected GCEA dataset

Run:

results_table = ABAnnotate(opt);

Advanced

You can define various options and provide precomputed data (see below). You can also use your own parcellation, but will then have to generate a custom ABA gene expression dataset. All options are shown in example/customization.md.

opt.analysis_name = 'GCEA_GeneOntology'; % name for analysis
opt.phenotype = '/path/to/input/volume.nii'; % input "phenotype" volume
opt.phenotype_nulls = '/path/to/precomputed/phenotype_nulls.mat'; % use already computed phenotype nulls
opt.n_nulls = 1000; % number of null phenotypes/categories, will be overwritten with n nulls from .phenotype_nulls
opt.atlas = 'SchaeferTian'; % one of {'SchaeferTian', 'Neuromorphometrics', 'Schaefer'} 
opt.dir_result = '/path/to/save/output'; % output directory
opt.GCEA.dataset = 'GO-biologicalProcessProp-discrete'; % selected GCEA dataset
opt.GCEA.size_filter = [5, 200]; % select categories with between 5 and 200 annotated genes
opt.GCEA.correlation_method = 'Spearman';  % one of {'Spearman', 'Pearson'}
opt.GCEA.aggregation_method = 'mean'; % one of {'mean', 'absmean', 'median', 'absmedian', 'weightedmean', 'absweightedmean'}
opt.GCEA.p_tail = 'right'; % one of {'right', 'left'}

ABAnnotate can incorporate "continuous" GCEA datasets with gene expression values across the whole genome for each category. This currently applies only to the BrainSpan dataset. You can choose your own thresholding settings to define marker genes and weight each gene-phenotype correlation by the gene's expression value when calculating category scores:

opt.analysis_name = 'GCEA_BrainSpan'; % name for analysis
opt.phenotype = '/path/to/input/volume.nii'; % input "phenotype" volume
opt.dir_result = '/path/to/save/output'; % output directory
opt.GCEA.dataset = 'ABA-brainSpan-weights'; % selected GCEA dataset
opt.GCEA.aggregation_method = 'weightedmean'; % one of {'mean', 'absmean', 'median', 'absmedian', 'weightedmean', 'absweightedmean'}
opt.GCEA.weights_quant = 0.90; % retain only genes with expression values > 0.9th quantile of the whole dataset
opt.GCEA.weights_cutoff = false; % if true, binarize expression values -> standard mean will be calculated. If false, use weighted mean
opt.GCEA.gene_coocc_thresh = 0.2; % retain only genes annotated to 20% or less of categories after weight thresholding

Default GCEA options are imported from gcea_default_settings.m.

Output

ABAnnotate's main output consists of a table with as many rows as there are categories in the current dataset.
Below you see an example output from the neuronal cell type dataset (transcripts per kilobase million; TPM). Here, we have marker sets for 24 cell types (Ex/In = excitatory/inhibitory neuron subclasses; see Lake et al. (2016) for detailed information). The three top categories are significant at FDR-corrected p < .05 using the nonparametric procedure.

cLabel = category name; cDesc = category descriptions; cSize = number of genes annotated to category; cGenes = official gene symbols; cWeights = expression values for each gene, will be vector of ones if discrete dataset (most cases); cScoresNull = null category scores (here, 5000 null samples); cScorePheno = e.g., mean of r-to-z-transformed phenotype-gene Spearman correlation coefficients for all genes in category; pValPerm = exact p-value derived from the null distribution of category scores; pValPermCorr = FDR-corrected "q"-value; pValZ(Corr) = p-value derived from Z-distribution fitted to the null data to approximate very small p-values.

example output table

Output files: A .mat-file with the table (see above) and the input options struct, a .csv-file with a reduced version of the table, a .xml file generated from the options struct, and a log file with the matlab terminal output.

Example visualizations

A/B: neuroimaging phenotype — neuronal cell type associations (see table above); A: bars representing category scores, color showing the negative base 10 logarithm of the uncorrected p-values derived from the z-distribution, *FDR-significant (nonparametric); B: gene-wise spatial correlation patterns for each gene annotated to one of the three significantly associated cell types.
C: neuroimaging phenotype — developmental brain-regional gene expression (BrainSpan); point size representing category scores, color showing the negative base 10 logarithm of the uncorrected p-values derived from the z-distribution, squares mark FDR-significant (nonparametric) categories.

example output figure

Working Example

In example/example_pain.md, I provide exemplary analyses using ABAnnotate to relate a meta-analytic brain map of pain processing to the integrated neuronal cell type markers, BrainSpan, and GeneOntology "biological process" datasets. In example/customization.md, I line out several implemented customization options.

What to cite

If you use ABAnnotate in publications, please cite the following sources:

Contact

Do you have questions, comments or suggestions, would like to contribute to the toolbox, or would like to see a certain gene-category dataset added to the toolbox? Open an issue or contact me!

To do


Back to the top