AlexsLemonade / scpca-downstream-analyses

This repository is intended to store our pipeline for marker genes analysis.
0 stars 0 forks source link

ScPCA downstream analyses

Table of Contents

Core analysis overview

This repository stores workflows used for performing downstream analyses on quantified single-cell and single-nuclei gene expression data available on the Single-cell Pediatric Cancer Atlas portal. More specifically, the repository currently contains a core workflow that performs initial pre-processing of gene expression data. Future development will include addition of optional workflows for extended data analyses to be applied to datasets after running the core workflow.

The core workflow takes as input the gene expression data for each library being processed and performs the following steps:

  1. Filtering: Each library is filtered to remove any low quality cells. Here filtering and removal of low quality cells can be performed using miQC::filterCells() or through setting a series of manual thresholds (e.g., minimum number of UMI counts). In addition to removing low quality cells, genes found in a low percentage of cells in a library can optionally be removed.
  2. Normalization and dimensionality reduction: Cells are normalized using the deconvolution method from Lun, Bach, and Marioni (2016) and reduced dimensions are calculated using both principal component analysis (PCA) and uniform manifold approximation and projection (UMAP). Normalized log counts and embeddings from PCA and UMAP are stored in the SingleCellExperiment object returned by the workflow.
  3. Clustering: Cells are assigned to cell clusters using graph-based clustering. By default, louvain clustering is performed using the bluster::NNGraphParam() function using the default nearest neighbors parameter of 10. Alternatively, walktrap graph-based clustering can be specified, and the number of nearest neighbors parameter can be altered if desired. Cluster assignments are stored in the SingleCellExperiment object returned by the workflow.

You can read more details about the individual steps of the workflow in the processing documentation linked below:

View Processing Information Documentation

Quick Start Guide

To run the core analysis workflow you will want to implement the following steps in order:

  1. Clone the repository and install Snakemake using the instructions provided in the Snakemake docs.
  2. Install the packages and dependencies that are required to run the workflow.
  3. Ensure that the input single-cell gene expression data are stored as SingleCellExperiment objects in RDS files (see more on this in the "Input data format" section). The workflow can directly take as input the filtered RDS files downloaded from the Single-cell Pediatric Cancer Atlas portal or the output from the scpca-nf workflow, a workflow that can be used to quantify your own single-cell/single-nuclei gene expression data. If working with data from the ScPCA portal, see more information on preparing that data to run the core workflow here.
  4. Create a metadata tab-separated value (TSV) file that defines the sample id, library id, and filepath associated with the pre-processed SingleCellExperiment files to be used as input for the workflow.
  5. Configure the config file to adjust the results_dir and project_metadata parameters to point to the full path to your desired results directory and project metadata file that you created in step 4.
  6. Open terminal to run the workflow using the following snakemake command:
snakemake --cores 2 --use-conda

Note that R 4.2 is required for running our pipeline, along with Bioconductor 3.16. Package dependencies for the analysis workflows in this repository are managed using renv, and renv must be installed locally prior to running the workflow. If you are using conda, dependencies can be installed as part of the setup mentioned in step 2 above.

If you did not install dependencies with conda via snakemake in step 2, you will need to remove the --use-conda flag from the command above. See the section on running the workflow for more information.

Output Files There are two expected output files thay will be associated with each provided SingleCellExperiment object and library_id:

See the expected output section for more information on these output files. You can also download a ZIP file with an example of the output from running the core workflow, including the summary HTML report and processed SingleCellExperiment objects stored as RDS files here.

1. How to install the core downstream analyses workflow

a) Clone the repository

First you will want to clone the scpca-downstream-analyses repository from GitHub.

We recommend cloning this repository into a separate folder specifically for git repositories. Open a local Terminal window and use cd to navigate to the desired local directory for storing the repository, and run the following command to clone the repository:

git clone https://github.com/AlexsLemonade/scpca-downstream-analyses.git

More instructions on cloning a GitHub repository can be found here.

Once the repository is successfully cloned, a folder named scpca-downstream-analyses containing a local copy of the contents of the repository will be created.

b) Install Snakemake

What is Snakemake?

The core downstream single-cell analysis pipeline, which includes filtering, normalization, dimensionality reduction, and clustering is implemented using the Snakemake workflow manager. With Snakemake, users can run the core downstream analysis pipeline with a single call, rather than having to run each core downstream analysis script on its own. Therefore, you will also need to install Snakemake before running the pipeline.

How to install Snakemake

Note that the minimum version of Snakemake you will need to have installed is version 7.20.0.

You can install Snakemake by following the instructions provided in Snakemake's docs.

Snakemake recommends installing it using the conda package manager. Here are the instructions to install conda. We recommend the Miniconda installation.

After installing conda, you can follow the steps below to set up the bioconda and conda-forge channels and install Snakemake in an isolated environment:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

conda install -n base mamba
conda activate base
mamba create -n snakemake snakemake
conda activate snakemake

c) Additional dependencies

To run the Snakemake workflow, you will need to have R version 4.2 installed, as well as the optparse and renv packages and pandoc. This can be done independently if desired, but we recommend using Snakemake's conda integration to set up the R environment and all dependencies that the workflow will use, as described below.

Snakemake/conda installation

Snakemake can handle the required dependencies by creating its own conda environments, which we have provided as an option. To create the necessary environment, which includes an isolated version of R, pandoc, and all required dependencies, run the following command from the base of the repository:

bash setup_envs.sh

This script will use Snakemake to install all necessary components for the workflow in an isolated environment. If you are on an Apple Silicon (M1/M2/Arm) Mac, this should properly handle setting up R to use an Intel-based build for compatibility with Bioconductor packages.

To use the environment you have just created, you will need to run Snakemake with the --use-conda flag each time.

If you would like to perform installation without the conda environments as described above, see the independent installation instructions document.

2. Verify input data format

The expected input for our core single-cell downstream analysis pipeline is a SingleCellExperiment object that has been stored as a RDS file. This SingleCellExperiment object should contain non-normalized gene expression data with barcodes as the column names and gene identifiers as the row names. All barcodes included in the SingleCellExperiment object should correspond to droplets likely to contain cells and should not contain empty droplets (e.g., droplets with FDR < 0.01 calculated with DropletUtils::emptyDropsCellRanger()). The full path to each individual RDS file should be defined in the project metadata described in the following "How to run the pipeline" section.

The pipeline in this repository is setup to process data available on the Single-cell Pediatric Cancer Atlas portal and output from the scpca-nf workflow where single-cell/single-nuclei gene expression data is mapped and quantified using alevin-fry. For more information on the this pre-processing, please see the ScPCA Portal docs. Note however that the input for this pipeline is not required to be scpca-nf processed output.

If you are working with data downloaded from the ScPCA portal, see our guide on preparing that data to run the core workflow here.

3. Create metadata file

Now the environment should be all set to implement the Snakemake workflow. Before running the workflow, you will need to create a project metadata file as a tab-separated value (TSV) file that contains the relevant data for your input files needed to run the workflow. The file should contain the following columns:

View Example Metadata File

4. Configure config file

We have provided an example snakemake configuration file, config/config.yaml which defines all parameters needed to run the workflow. Learn more about snakemake configuration files here.

The config file contains two sets of parameters:

You can modify the relevant parameters by manually updating the config/config.yaml file using a text editor of your choice.

To run the workflow on your data, modify the following parameters in the config/config.yaml file:

Parameter Description
results_dir full path to the directory where output files from running the core workflow will be stored
project_metadata full path to your specific project metadata TSV file
mito_file full path to a file containing a list of mitochondrial genes specific to the reference genome or transcriptome version used for alignment. By default, the workflow will use the mitochondrial gene list obtained from Ensembl version 104 of the Human transcriptome which can be found in the reference-files directory.

By default, these parameters point to the example data. The two example _filtered.rds files were both processed using the scpca-nf workflow. Therefore, if you would like to test this workflow using the example data, you can continue to the next step, running the workflow, without modifying the config file.

The config file also contains additional processing parameters like cutoffs for minimum genes detected, minimum unique molecular identifiers (UMI) per cell, etc. We have set default values for these parameters. Learn more about the additional processing parameters and how to modify them.

See the processing information documentation for more information on the individual workflow steps and how the parameters are used in each of the steps.

View Config File

5. Running the workflow

After you have successfully modified the required project-specific parameters in the config file, you can run the snakemake workflow with just the --cores and --use-conda flags as in the following example:

snakemake --cores 2 --use-conda

It is mandatory to specify the number of CPU cores for snakemake to use by using the --cores flag. If --cores is given without a number, all available cores are used to run the workflow.

Note: If you did not install dependencies with conda via snakemake, you will need to remove the --use-conda flag.

You can also modify the config file parameters at the command line, rather than manually as recommended in step 4. See our command line options documentation for more information.

Note: For Data Lab staff members working on development, new changes should be merged through a pull request to the development branch. Changes will be pushed to the main branch once changes are ready for a new release (per the release checklist document).

6. Expected output

For each SingleCellExperiment and associated library_id used as input, the workflow will return two files: a processed SingleCellExperiment object containing normalized data and clustering results, and a summary HTML report detailing the filtering of low quality cells, dimensionality reduction, and clustering that was performed within the workflow. These files can be found in the example_results folder, as defined in the config.yaml file. Within the example_results folder, output files for each library will be nested within folders labeled with the provided sample_id. Each output filename will be prefixed with the associated library_id. Below is an example of the nested file structure you can expect.

example_results
└── sample_id
     ├── <library_id>_core_analysis_report.html
     └── <library_id>_processed.rds

The <library_id>_core_analysis_report.html file is the html file that contains the summary report of the filtering, dimensionality reduction, and clustering results associated with the processed SingleCellExperiment object.

The <library_id>_processed.rds file is the RDS file that contains the final processed SingleCellExperiment object (which contains the filtered, normalized data and clustering results).

You can also download a ZIP file with an example of the output from running the core workflow, including the summary HTML report and processed SingleCellExperiment objects stored as RDS files here.

What to expect in the output SingleCellExperiment object

As a result of the normalization step of the workflow, a log-transformed normalized expression matrix can be accessed using logcounts(sce).

In the colData of the output SingleCellExperiment object, you can find the following:

In the metadata of the output SingleCellExperiment object, which can be accessed using metadata(sce), you can find the following information:

Metadata Key Description
scpca_filter_method The type of filtering performed (miQC or manual) on the expression data.
prob_compromised_cutoff The maximum probability of cells being compromised, which is only present when the filtering_method is set to miQC.
miQC_model The linear mixture model calculated by miQC and therefore is only present when filtering_method is set to miQC.
mito_percent_cutoff Maximum percent mitochondrial reads per cell threshold, which is only present when filtering_method is set to manual.
genes_filtered Indicates whether or not genes have been filtered.
detected_gene_cutoff Minimum number of genes detected per cell, which is only present when filtering_method is set to manual.
umi_count_cutoff Minimum unique molecular identifiers (UMI) per cell, which is only present when filtering_method is set to manual.
num_filtered_cells_retained The number of cells retained after filtering using the specified filtering method, either miQC or manual.
normalization The type of normalization performed (log-normalization or deconvolution when clustering of similar cells using scran::quickCluster() prior to normalization with scater::logNormCounts() is successful).
highly_variable_genes The subset of the most variable genes, determined using scran::getTopHVGs().

You can find more information on the above in the processing information documentation.

Additional analysis modules

Clustering analysis

There is an optional clustering analysis workflow stored in the optional-clustering-analysis subdirectory of this repository. This workflow can help users identify the optimal clustering method and parameters for each library in their dataset. Libraries are unique, which means that the optimal clustering is likely to be library-dependent. The clustering analysis workflow provided can be used to explore different methods of clustering and test a range of parameter values for the given clustering methods in order to identify the optimal clustering for each library.

For more on what's in the clustering analysis workflow and how to run the workflow, see the README.md file in the clustering analysis subdirectory.

Genes of interest analysis

There is an optional genes of interest analysis pipeline in the optional-goi-analysis subdirectory of this repository. This workflow can help users evaluate expression of a specific list of genes in their sample dataset and how the expression values compare to the remaining genes in the dataset.

For more on what's in the genes of interest analysis workflow and how to run the workflow, see the README.md file in the genes of interest analysis subdirectory.

Data integration analysis

There is an optional data integration analysis pipeline in the optional-integration-analysis subdirectory of this repository. This workflow can can be used to combine data from multiple individual single-cell/single-nuclei libraries to obtain a single integrated object.

For more on what's in the data integration analysis workflow and how to run the workflow, see the README.md file in the integration analysis subdirectory.