WGLab / SCOTCH

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing
MIT License
5 stars 0 forks source link

SCOTCH

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing. See our pre-print at here. All codes and analysis results in the manuscript can be found at here

Background

Recent development involving long-read single-cell transcriptome sequencing (lr-scRNA-Seq) represents a significant leap forward in single-cell genomics. With the recent introduction of R10 flowcells by Oxford Nanopore, computational methods should now shift focus on harnessing the unique benefits of long reads to analyze transcriptome complexity. In this context, we introduce a comprehensive suite of computational methods named Single-Cell Omics for Transcriptome CHaracterization (SCOTCH). Our method is compatible with the single-cell library preparation platform from 10X Genomics, Pacbio Biosciences, and Parse Biosciences, facilitating the analysis of special cell populations, such as neurons, hepatocytes and developing cardiomyocytes.

SCOTCH offers both a preprocessing and a statistical pipeline. The preprocessing pipeline accepts BAM files with tagged barcodes generated by vendor-supplied pipelines (e.g., wf-single-cell) and aligns reads to both known and novel isoforms. SCOTCH can leverage existing high-quality gene annotations, update these annotations using the data to enhance novel isoform identification, or operate entirely annotation-free. SCOTCH outputs count matrices at both the gene and transcript levels. The statistical pipeline supports the analysis of differential transcript usage (DTU), which is determined by the relative abundances of transcripts within the same gene.

Part1: Preprocessing Pipeline

Installation

To avoid package version conflicts, we strongly recommand to use conda to set up the environment. If you don't have conda installed, you can run codes below in linux to install.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash Miniconda3-latest-Linux-x86_64.sh

After installing conda, SCOTCH sources can be downloaded.

git clone https://github.com/WGLab/SCOTCH.git
cd SCOTCH
conda env create --name SCOTCH --file scotch.yml
conda activate SCOTCH

Run SCOTCH preprocessing pipeline

main_preprocessing.py is the primary function for running the SCOTCH preprocessing pipeline. This pipeline consists of four steps: (1) generating gene annotation files (--task annotation), (2) generating the compatibility matrix (--task 'compatible matrix'), (3) summarizing novel gene annotations(--task summary), and (4) generating the count matrix (--task 'count matrix'). We will cover more details below. SCOTCH is currently compatible with several platforms (set: --platform 10x-ont or --platform 10x-pacbio or parse-ont)

Input and Output

In general, SCOTCH accepts BAM file(s) tagged by vendor-supplied pipelines for its preprocessing pipeline (For ONT data, please see here for more details), and output a series of information including read-isoform mappings, gene count matrix, isoform count matrix, and updated gene annotations. The --bam argument is used to assign input file(s), and --target is used to assign the output folder(s).

The SCOTCH pipeline allows users to process one sample at a time or multiple samples simultaneously depending on study designs. When preprocessing multiple samples together, SCOTCH generates a unified gene/isoform annotation based on all BAM files. This approach is particularly beneficial in studies involving multiple samples, as it ensures consistent identification and consolidation of novel isoforms across different samples. Some usage examples are below:

Example1: --bam accepts a single bam file

--bam sample1.bam --target path/to/sample1/output

Example2: --bam accepts multiple bam files, each from one sample

--bam sample1.bam sample2.bam --target path/to/sample1/output path/to/sample2/output

Example3: --bam accepts a folder of bam files from one sample

--bam path/to/sample1/bamfiles/folder --target path/to/sample1/output

Example4: --bam accepts multiple bam folders, each from one sample

--bam path/to/sample1/bamfiles/folder path/to/sample2/bamfiles/folder --target path/to/sample1/output path/to/sample2/output

Below is the directory structure of output under each target folder you can expect:

Step1: prepare annotation file

In this step, SCOTCH will generate annotation files for the reference genome and tagged BAM files. Set --tast annotation to run this step. Note that, users have to use the same gtf file throughout, including during the genome alignment step, which is typically processed by vendor-supplied pipelines.

SCOTCH offers three modes for generating gene annotations:

  1. Annotation-Only Mode: SCOTCH can rely entirely on existing gene annotations. This mode allows for the discovery of novel isoforms defined by combinations of known exons. This mode will fail to identify novel isoforms involving unknown exons, such as intron retention. Set --reference as path to gene annotation .gtf file. To save time, users can also set --reference_pkl as the path to SCOTCH generated annotation based on given gtf file. SCOTCH has pre-computated this file based on this (human hg38) provided by 10X genome. In addition, set --update_gtf_off.
  2. Enhanced-Annotation Mode: SCOTCH can use BAM files from one or multiple samples to update and refine existing gene annotations. This mode allows for the discovery of de novo (sub)exons with more types of novel isoforms than annotation-only mode. Set --reference as path to gene annotation .gtf file. To save time, users can also set --reference_pkl as the path to SCOTCH generated annotation based on given gtf file. SCOTCH has pre-computated this file based on this (human hg38) provided by 10X genome. In addition, set --update_gtf.
  3. Annotation-Free Mode: SCOTCH can generate gene and isoform annotations based solely on BAM files, allowing for the discovery of novel genes and isoforms. Set --reference None and --reference_pkl None.

arguments

Below is an example of generating annotation in the Enhanced-Annotation Mode for two samples simultanuously. See example/annotation.sh for an implementation in slurm.

python3 src/main_preprocessing.py \
--task annotation \
--platform 10x-ont \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--bam path/to/bam/file/or/bamfolder/sample1 path/to/bam/file/or/bamfolder/sample2 \
--update_gtf \
--reference path/to/reference/genes.gtf \
--reference_pkl data/geneStructureInformation.pkl \
--workers 30

Step2: generate compatible matrix

In this step, SCOTCH aligns reads to existing gene isoforms and identifies novel isoforms based on the annotation files generated in step 1. Set --tast 'compatible matrix' to run this step. A compatibility matrix is generated for each gene, where each row represents a read and each column represents a gene isoform. Detailed information on read-isoform mappings can be found in the auxiliary folder. To speed up this step, job arrays can be submitted in SLURM to process genes in parallel. An example implementation for 100 job arrays is provided in example/compatible.sh.

python3 src/main_preprocessing.py \
--task 'compatible matrix' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--bam path/to/bam/file/or/bamfolder/sample1 path/to/bam/file/or/bamfolder/sample2 \
--reference path/to/reference/genes.gtf 

Step3: summary gene annotations

In this step, SCOTCH will generate a new gene annotation file, including all novel isoforms identified. Set --tast summary to run this step. See example/summary.sh for an implementation in SLURM.

python3 src/main_preprocessing.py \
--task 'summary' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2

Step4: generate count matrix

Finally, SCOTCH will generate gene- and isoform-level copunt matrix. Set --tast 'count matrix' to run this step. See example/count.sh for an implementation in SLURM.

Part2: Statistical Pipeline

Installation

In R, run below codes to install SCOTCH statistical pipeline.

if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
library("devtools")
install_github("WGLab/SCOTCH")

DTU analysis

Below is sample codes for DTU analysis. Example data can be found in /data folder.

library(SCOTCH)

#----read gene-level count matrix-----#
sample8_CD4_gene=as.matrix(read.csv('gene_count_1.csv',row.names = 'X'))
sample8_CD8_gene=as.matrix(read.csv('gene_count_2.csv',row.names = 'X'))

#----read transcript-level count matrix-----#
sample8_CD4_transcript=as.matrix(read.csv("transcript_count_1.csv",row.names = 'X'))
gene_transcript_CD4_df = data.frame(genes=str_remove(colnames(sample8_CD4_transcript),"\\.(ENST|novel|uncategorized).+"),
                                    transcripts=colnames(sample8_CD4_transcript))

sample8_CD8_transcript=as.matrix(read.csv("transcript_count_2.csv",row.names = 'X'))
gene_transcript_CD8_df = data.frame(genes=str_remove(colnames(sample8_CD8_transcript),"\\.(ENST|novel|uncategorized).+"),
                                    transcripts=colnames(sample8_CD8_transcript))

#----gene-level analysis-----#
df_gene = scotch_gene(sample8_CD4_gene, sample8_CD8_gene), epsilon=0.01,ncores=10)%>%
  filter(pct1>=0.01|pct2>=0.01)

#----transcript-level analysis-----#
df_transcript = scotch_transcript(gene_transcript_CD4_df,gene_transcript_CD8_df, 
                                  sample8_CD4_transcript, sample8_CD8_transcript, ncores=10)
df_scotch = df_gene%>%left_join(df_transcript,by=c('genes'='gene'))

SCOTCH will output the following statistics in results:

genes: gene name
pct1: percent of cells expression the gene in cell population 1
pct2: percent of cells expression the gene in cell population 2
logFC: fold change in log scale
p_gene: p value on the gene level
pct_diff: difference between pct1 and pct2
p_gene_adj: adjusted p value of differential gene expression
isoforms: isoform name
alpha1: estimated parameter of the dirichlet distribution for cell population 1
alpha2: estimated parameter of the dirichlet distribution for cell population 2
TU1: relative transcript usage of cell population 1
TU2: relative transcript usage of cell population 2
TU_diff: difference of TU1 and TU2
TU_var1: variance of TU1
TU_var2: variance of TU2
dispersion1: dispersion parameter for cell population 1
dispersion2: dispersion parameter for cell population 2
isoform_switch: whether there is isoform switching event
isoform_switch_ES: effect size of isoform switching event
p_DTU_gene: p value for differential transcript usage analysis for the whole gene
p_transcript: p value for differential transcript usage analysis for the specific transcript
p_transcript_adj: adjusted p value for differential transcript usage analysis for the specific transcript
p_DTU_gene_adj: adjusted p value for differential transcript usage analysis for the whole gene