efratmuller / MintTea

MintTea: A pipeline for identifying multi-omic disease associated modules in microbiome data
1 stars 0 forks source link

MintTea: A pipeline for identifying multi-omic disease associated modules in microbiome data

Table of contents:

MintTea overview

MintTea is a method for identifying multi-omic modules of features that are both associated with a disease state and present strong associations between the different omics. It is based on sparse generalized canonical correlation analysis (sgCCA), where the disease label is encoded as an additional 'dummy' omic, as previously suggested by Gross & Tibshirani (2015)1, Singh et al. (20192, see DIABLO), and others.

For further details see: Muller, Efrat, Itamar Shiryan, and Elhanan Borenstein. "Multi-omic integration of microbiome data for identifying disease-associated modules." Nature Communications 15.1 (2024): 2621. Link


Installation

MintTea can be installed directly from GitHub, by running the following:

install.packages(devtools)  
library(devtools)   
install_github("efratmuller/MintTea")   
library(MintTea)

Instructions - Running MintTea on your own data

  1. Open an R script from which the MintTea function will be executed.

  2. Organize your input data in a single data.frame object, following these guidelines:

    • Rows represent samples and columns are features;
    • The dataframe should include two special columns: a column holding sample identifiers and a column holding study groups ("healthy" and "disease" labels);
    • Features from each omic should start with the omic-prefix (for example: 'T' for taxonomy, 'P' for pathways, 'M__' for metabolites, etc. Note the two consecutive underscores);
    • Features in each view should be pre-processed in advance, according to common practices;
    • It is highly recommended to remove rare features, and cluster highly correlated features;
  3. Optionally, edit the default pipeline parameters. MintTea supports running the pipeline with multiple parameter combinations, to encourage sensitivity analysis and enable the user to check which settings generate the most informative modules. For the full list of MintTea paramaters, see: ?MintTea.

  4. Pipeline results are returned as a list of multi-view modules, given for each MintTea pipeline setting requested. For each module, the following properties are returned:

    Module property Details
    module_size The number of features in this module.
    features 1st prinicipal component (PC) of each module, for each pipeline setting.
    module_edges Edge weights for every pair of features in this module that co-occured in sGCCA components at least once. Edge weights are calculated as the number of times each pair co-occured in the same sGCCA component, divided by param_n_repeats * param_n_folds. These weights are given in case the user wants to draw the module as a network.
    auroc AUROC of each module by itself, describing the module's association with the disease. Computed using its first PC and evaluated over repeated cross-validation. Note: It is warmly advised to further evaluate module-disease associations using an independent test set.
    shuffled_auroc As above, but using 99 randomly sampled modules of the same size and same proprtions of views.
    inter_view_corr Average correlation between features from different views.
    shuffled_inter_view_corr As above, but using 99 randomly sampled modules of the same size and same proprtions of views.
  5. To evaluate the obtained results, we recommend starting by examining the following:

    • For each pipeline setting - how many modules were found, and what are the module sizes (i.e., number of features included)?
    • What was the AUC achieved by each module? (see auroc)
    • How does this AUC compare to the random-modules AUC's?

Tips:


Usage example

library(MintTea)
data('test_data')
minttea_results <- MintTea(test_data, view_prefixes = c('T', 'P', 'M'))

For questions about the pipeline, please open an issue (https://github.com/efratmuller/MintTea/issues) or contact Prof. Elhanan Borenstein at elbo@tauex.tau.ac.il.


Backlog:

 * Support parallel running to shorten runtimes.
 * Generalize to support continuous labels.

1 Gross, Samuel M., and Robert Tibshirani. "Collaborative regression." Biostatistics 16.2 (2015): 326-338.

2 Singh, Amrit, et al. "DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays." Bioinformatics 35.17 (2019): 3055-3062.