Papatheodorou-Group / CATD_snakemake

Snakemake pipeline for benchmarking cell-type deconvolution methods and deconvolving real bulk RNA-seq data with the use of scRNA-seq datasets
GNU General Public License v3.0
15 stars 2 forks source link
cell-type-abundance deconvolution deconvolution-methods pipeline single-cell-rna-seq snakemake

CATD_snakemake

Critical Assesment of Transcriptomic Deconvonvolution-snakemake pipeline


                                                    \`*-.                   
                                                     )  _`-.    Hi          
                                                    .  : `. .               
                                  Hey               : _   '  \              
                     .-=-.          .--.            ; *` _.   `*-._         
         __        .'     '.       /  " )           `-.-'          `-.      
 _     .'  '.     /   .-.   \     /  .-'\             ;       `       `.    
( \   / .-.  \   /   /   \   \   /  /    ^            :.       .        \   
 \ `-` /   \  `-'   /     \   `-`  /                  . \  .   :   .-'   .  
  `-.-`     '.____.'       `.____.'                   '  `+.;  ;  '      :  
                                                      :  '  |    ;       ;-.
https://ascii.co.uk/art/snake                         ; '   : :`-:     _.`* ;
                                                   .*' /  .*' ; .*`- +'  `*'
                                                    `*-*   `*-*  `*-*'     

                                                   https://ascii.co.uk/art/cat

Maintainer contact : Anna Vathrakokoili Pournara annavp@ebi.ac.uk\ Software Authors ✍🏼 ✍🏼: Chichao Miao, Anna Vathrakokoili Pournara & Ozgur Beker\ Snakemake Pipeline implementation 👨‍💻 : Ozgur Beker ozgurbeker@sabanciuniv.edu

Many Thanks to Nadja Nolte for testing and further developping.

If you are interested in our pipeline or you would like to include your method in the CATD pipeline please contact us or create a pull request.

Please use the issue tracker if you encounter a problem

Table of contents

Overview

The CATD pipeline is a benchmarking pipeline meant to facilitate the assessment of cell-type deconvolution methods(29 now) across different simulation scenarios in a standardised way. It also allows the deconvolution of real bulk samples with many user-unput parameters so that users can deconvolute their own in-house data following our proporsed guidelines. The pipeline includes:

For more details check our preprint:

Vathrakokoili Pournara, A., Miao, Z., Beker, O. Y., Brazma, A. & Papatheodorou, I. Power analysis of cell-type deconvolution methods across tissues. http://biorxiv.org/lookup/doi/10.1101/2023.01.19.523443 (2023) doi:10.1101/2023.01.19.523443.

Repo contents

The github repository includes:

/bin
.
├── basicSetup.sh  #script for the basic set up 
├── config.yaml #config file with all the user-defined parameters 
├── getDag.sh  #get DAG graphic aka a directed acyclic graph (DAG) of jobs where the edges represent dependencies
├── runPip.sh  #script to run the pipeline
└── Snakefile  #Driver script for the deconvolution assessment pipeline

Each Module consists of an enviroment file, a snakefile with the rule(s) used in this module & the script to run the method e.g

/Modules/EpiDISH
.
├── env.yaml
├── EpiDISH_run.R
└── Snakefile

System Requirements

Hardware requirements

Except for file format conversion and results generation, it is NOT recommended to run the pipeline on personal computers.

Memory: File format conversion and results generation parts of the pipeline require relatively less memory (enough to load the dataset into the session, ~4-5 GBs for single cell data of ~30k cells). The memory requirement for pseudo-bulk generation scales with cores provided, so numThreads*memData amount of memory should be allocated, with again a good baseline being ~4-5 GBs per thread allocated. For deconvolution with CDSeq, Bisque, CPM, DWLS, MuSiC, SCDC, TIMER, bseqsc and MOMF the memory requirement should be around the same as pseudo-bulk generation. For all other intermediate operations, minimal memory is needed. Note that the memory requirements will definitely change with the size of data, so users are recommended to check logs for specific runs/modules and scale the baseline values given in config.yaml accordingly.

Threads: It is highly recommended to run the pipeline with at least 16 threads to avoid long runtimes, especially for pseudo-bulk generation. For bseqsc and CIBERSORT, 3 should be the maximum amount of threads. For DWLS, threads should be equal (ideally) or smaller than the number of cell types. Allocating more threads will not cause any speed-ups for these methods.

For a general comparison of methods and their memory requirements, users can consult Supplementary Figure 9 which was generated with the default values in config.yaml.

Software requirements

OS:

There are no specifications on OS requirements, any OS which can run shell scripts within the pipeline and adhere to remaining requirements should work. The pipeline was tested on Linux with Ubuntu 20.04.

Other software:

Mamba package manager (Mambaforge is recommended) should be installed, with instructions available at the respective repository.

Excluding the initial Snakemake install, conda environments for each module of the pipeline will be handled automatically by Snakemake / Mamba. To achieve this, the setup script will prompt for the absolute path of your conda.sh file. The only exception is the CIBERSORT.R file which is required for CIBERSORT and bseqsc deconvolution. Users should register and download the code manually, then place it under Modules/CIBERSORT and Modules/bseqsc respectively.

Installation guide

Instructions

  1. Download the pipeline to your directory of choice(also install git-lfs in your system before cloning, you can use conda)

    git clone https://github.com/Functional-Genomics/CATD_snakemake.git
  2. Set up Conda environment with snakemake, pipeline strictly uses mamba for much faster environment setup.

    mamba create -c conda-forge -c bioconda -n snakemake snakemake=7.14.0

    downgrade tabulate=0.8.10 :

    mamba install tabulate=0.8.10
  3. Build the pipeline.

    make

If you are running the pipeline for the first time you should use few samples to run in order for all the environments to be installed fast( each module/method in our pipeline has its own environment.

  1. Place the input file in the Input directory that has been created \ follow the link to the example file - use either XinY_4Decon.h5ad or XinY_4Decon_seurat.rds )

  2. Adjust settings in config.yamllike suggested: (familiarise yourself with the config.yaml file and all the parameters that you can choose)\ Specifically:\ sampleNames: #Enter the part BEFORE file extension (ex. 'data.h5ad' should be entered as 'data')\

    • XinY_4Decon # in this case\ pbulkParam:\ pbulkMode: 2 #Different mode for building pbulks\ cellCount: 100 #How many cells should be used to create a bulk sample\ nSamples: 10 #How many different samples should be generated #first time only generate one sample\ propVar: 500 #Variances inbetween proportions, check vignette (EFFECTIVE IN MODE 2 ONLY, enter negative value to switch to min max prop mode)\ sampleCT: 0 #Sampling for cell types allowed or not (1 or 0) (EFFECTIVE IN MODE 2 ONLY)
  3. (Optional) Run getDag.sh to generate the updated DAG after adjusting config. \ This will output a dag.png where you can observe the jobs that will be executed

  4. (Optional) If on cluster set up cluster profile if you haven't, instructions available here.

  5. (Optional) Dry run:

    conda activate snakemake

    snakemake -n

  6. Run the pipeline using bsub < runPip.sh (LSF) or through snakemake --use-conda --cores [N] if on local.

Time

The steps(1-2-3) should take few seconds to complete.\ If all the environments are included in the first run it will take around 25-30 minutes,roughly 1min/environment is a good estimate.(step 9)\ When the job(step 9) finishes all the environments will have been installed and users are ready to run actual experiments and test the pipeline

Running the pipeline

IMPORTANT:As mentioned above, if running the pipeline for the first time, use very few samples (e.g 10) only as all the environments are installed throughout the workflow.

Self-reference deconvolution

Description

Uses one single-cell reference to generate both the pseudobulks and the reference for deconvolution benchmarking. Important assumptions:

Inputs:

Input/{sampleName}.h5ad             

OR              

Input/{sampleName}_seurat.rds

Example data:

Outputs:

Instructions

Cross-reference deconvolution

Description

Uses two single-cell references to generate the pseudobulks and references for deconvolution benchmarking. Important assumptions are:

Inputs:

Input/Cell_splits/{sampleName}_gen.rds      (Will be used to generate psuedobulks)              
Input/Cell_splits/{sampleName}_C0.rds       (Will be used to generate references)

Example data:

(a)

(b)

Outputs:

Same as self-reference.

Instructions

Make sure that the inputs conform to the standards written in the 'Inputs' section above.

Real bulk RNA-Seq deconvolution

Description

Uses one reference single cell matrix with user-defined bulks and known proportions for deconvolution benchmarking. Assumptions are:

The first two options will only use half of the data to generate references. The third will use all of the data to generate the reference. Alongside the reference, you need to input the pseudo-bulks inside the folder specified under those names.

Example data:

FinotelloB_WilkR_props.rds #ground truth proportions table\ FinotelloB_WilkR_pbulks.rds#real bulk matrix (Finotello et al)\ FinotelloB_WilkR_C0.rds#WilkR single-cell expression matrix

Outputs

Same as self-reference.

Instructions

Same as self-reference, except in the 4th step, note the following directory:

Input/Psuedobulks

This is where the bulks and the ground truth proportions should go(FinotelloB_WilkR_pbulks.rds & FinotelloB_WilkR_props.rds). Also, enable realBulk in config.yaml \ If you wish to use all the data for the reference, go to:

Input/Cell_splits

and place the reference file here under the name {sampleName}_C0.rds(FinotelloB_WilkR_C0.rds).

If you want to run your own real bulk data with a single-cell reference but you do not have ground truth proportions enable realBulk-noProp in config.yaml in this case you will not have a {sampleName}_props.rds and the pipeline will not calculate any metrics and will output only a heatmap

Outputs (realBulk-noProp)

Same as self-reference.

Appendices

Running CIBERSORT and bseqsc in the pipeline:

Users need to download the CIBERSORT.R code manually and place it within the modules (Modules/CIBERSORT and Modules/bseqsc) to run these two methods. Please register in the website https://cibersortx.stanford.edu/register.php first and access the CS Archive gallery to download CIBERSORT.R

Sample directed acyclic graph (DAG) of the pipeline

The image below is generated by selecting only a couple of methods. If all methods are selected, snakemake cannot generate a pretty DAG as all deconvolution methods run in parallel.

Comprehensive Joblist

The full list of jobs looks like this:

job                           count    min threads    max threads
--------------------------  -------  -------------  -------------
Autogenes_run                     1              1              1
Bisque_run                        1              1              1
CDSeq_run                         1             16             16
CIBERSORT_run                     1              3              3
CPM_run                           1             16             16
CellmixInstall                    1              1              1
DCQ_run                           1              1              1
DSA_run                           1              1              1
DWLS_run                          1             16             16
DeconRNASeq_run                   1              1              1
EPIC_run                          1              1              1
EpiDISH_run                       1              1              1
FARDEEP_run                       1              1              1
Momf_run                          1              1              1
MuSiC_run                         1              1              1
NNLS_run                          1              1              1
OLS_run                           1              1              1
OmniInstall                       1              1              1
RLR_run                           1              1              1
SCDC_run                          1              1              1
TIMER_run                         1              1              1
all                               1              1              1
bseqsc_run                        1              3              3
convertAnndata                    1              1              1
debCAM_C1                         1              1              1
debCAM_marker                     1              1              1
deconf_run                        1              1              1
dtangle_run                       1              1              1
elasticNET_run                    1              1              1
exploreBenchmarks                 1              1              1
exploreResults                    1              1              1
generatePsuedobulks               1             32             32
generateReferences                1             32             32
lasso_run                         1              1              1
proportionsInAdmixture_run        1              1              1
ridge_run                         1              1              1
scaleTransform_T_C                1             32             32
splitCells                        1              1              1
ssFrobenius_run                   1              1              1
ssKL_run                          1              1              1
sumRun                            1              1              1
vioplotResults                    1              1              1
visualizeResults                  1              1              1
total                            43              1             32

Cleaning all outputs

Use make clean to delete all outputs from a workflow. Keep in mind that this will delete ALL outputs including results.