LuChenLab / SCAPE

Single Cell Alternative Polyadenylation using Expectation-maximization
MIT License
6 stars 2 forks source link

Single Cell Alternative Polyadenylation using Expectation-maximization (SCAPE)

what is SCAPE?

SCAPE is a tool for estimating alternative polyadenylation (APA) events from single-end or pair-end scRNA-seq data (10x, Microwell-seq). The scRNA-seq must be generated by protocols based on polyT enrichment. SCAPE utilizes the insert size information of cDNA fragments to estimate polyadenylation sites (pA sites).

SCAPE models each pA sites as a Gaussian distribution, the corresponding isoform of which also has a weight (proportion). The final estimations are the mean and standard deviation of all pA sites, as well as their weights. The following figure shows the basic idea of SCAPE.

model_graph

SCAPE-APA (A update version, 2024.3.15)

To accommodate the analysis of extensive scRNA-seq datasets, SCAPE-APA has been introduced as a revamped version of SCAPE, incorporating significant modifications. The updates to the package include:

  1. Grouping similar reads to expedite the estimation process.
  2. A revised mixture model specifically adapted for binned reads.
  3. Implementation of a Taichi framework for enhanced computational speed.
  4. Detection of erroneous alternative polyadenylation sites arising from junction reads.
  5. Support for installation via PyPI.

ref: SCAPE-APA: a package for estimating alternative polyadenylation events from scRNA-seq data

Input

scRNA-seq data (bam file)

This can be single-end or pair-end sequencing scRNA-seq data. We only tested on data sequenced on Illumina platform. If special treatments have been taken regarding insert size in the library preparation, it is recommended to adjust the insert size distribution in the source code.

Gene annotation file (GTF file)

The Gene transfer format (GTF) is a file format used to hold information about gene structure. Since APA is an splicing event in the 3'-UTR, only reads falls in the 3'-UTR is relevant. GTF file is needed to pull out reads that fall into 3'-UTR, which is then used for the analysis.

GTF files for different species can be downloaded from Ensembl database.

Output

The output will be a csv file correpsonding to a pA count matrix. Each row is a pA site and each column is a cell.

Installation

There is no need to install anything other than python3. Just download this repository and run the python scripts.

Usage

SCAPE is written in Python. It is a pre-requisite to have python3 installed before running SCAPE. SCAPE has two general functions (1) data preprocessing and (2) APA event inference.

$ git clone https://github.com/LuChenLab/SCAPE

$ conda create -n scape python=3.7.3

$ conda source activate scape

$ pip install -r SCAPE/requirements.txt

$ python SCAPE/main.py --help

Usage: python main.py [OPTIONS] COMMAND [ARGS]...

  An analysis framework for analysing alternative polyadenylation at single
  cell levels. Current version: 1.0.0

Options:
  --help  Show this message and exit.

Commands:
  apamix
  prepare

prepare

Preprocessing utr and intron region for inferring alternative polyadenylation events

Usage: python main.py prepare [OPTIONS]

Options:
  --gtf TEXT     The gtf (gz or not, but must be sorted) file for preparing utr and intron region.
                 [required]

  --prefix TEXT  The prefix of output bed file.
  --help         Show this message and exit.

run with bedtools version 2.26.0

# bedtools v2.26.0
bedtools sort -i GRCh38.p12.cr.gtf  | bgzip > GRCh38.p12.cr.gtf.gz
python main.py prepare --gtf GRCh38.p12.cr.gtf.gz --prefix GRCh38

apamix

Perform the actual alternative polyadenylation events inference. If one wants to change the default paramters such as insert size distribution, maximum number of pA sites etc, it can be done by modifying the file "apamix.py" under folder "apamix".

Usage: python main.py apamix [OPTIONS]

Options:
  --bed TEXT             The target regions (bed format) used for mix
                         inference  [required]
  --bam TEXT             The bam file (sorted and indexed)  [required]
  -o, --out TEXT         The output path  [required]
  --cores INTEGER        Num (cores) of region are infering at once
  --cb TEXT              The cell barcode file, one cb for one line.
                         [required]
  --tag TEXT             The cell barcode and UMI tag, for 10X: CB,UB.
                         [required]
  --n_max_apa INTEGER    The maximum number of pA sites. Default value is 5.
                         [required]
  --n_min_apa INTEGER    The minimum number of pA sites. Default value is 1.
                         [required]
  --max_utr_len INTEGER  The maximum length of UTR. Default value is 6000.
  --la_dis_arr TEXT      The distinct lengths of polyA lengths in the dataset.
                         Default: np.arange(self.min_LA, self.max_LA, 10).
                         User counld pass '[10, 30, 50, 70, 90, 110, 130]'.
  --pmf_la_dis_arr TEXT  The the number of reads for each distinct polyA
                         length. Default: Unif(min_LA, max_LA). '[309912,
                         4107929, 802856, 518229, 188316, 263208, 101]'.
  --mu_f INTEGER         The mean of insert size for illumina library. Default: 300
  --sigma_f INTEGER      The std of insert size for illumina library. Default: 50
  -v, --verbose          Verbose mode
  --help                 Show this message and exit.

run


python main.py apamix \
--bed GRCh38_utr.bed \
--bam input.bam \
--out data/ \
--cores 12 \
--cb cellbarcode.tsv

Downstream analysis

The output of SCAPE can be loaded into a Seurat object in R, which could be easily use for downstream single cell analyses offered by Seurat.

Examples

Visit wiki for examples of running SCAPE.

Methods comparison on simulated data

The code for method comparison in our paper can be found here.

Citation

If you use SCAPE in your publication, please cite SCAPE by

Zhou et al. SCAPE: A mixture model revealing single-cell polyadenylation diversity and cellular dynamic during cell differentiation and reprogramming. Nucleic Acids Research. 2022, gkac167, https://doi.org/10.1093/nar/gkac167