gaolabtools / scNanoGPS

Single cell Nanopore sequencing data for Genotype and Phenotype
Other
34 stars 1 forks source link
cell-barcode demultiplexing gene-expression isoforms long-read-sequencing nanopore-analysis-pipeline rna-seq-pipeline single-cell single-nucleotide-variation umi-curation

Latest Release Total GitHub Downloads

scNanoGPS: Single cell Nanopore sequencing data for Genotype and Phenotype

scNanoGPS is a computational toolkit for analyzing high throughput single cell nanopore sequencing data to detect Genotypes and Phenotype Simultaneously from same cells. scNanoGPS includes 5 major steps: 1) NanoQC to perform quality control of the raw seqeucning data; 2) Scanner to scan and filter out reads that do not have expected adapater sequence patterns, i.e., TrueSeq Read 1 adapter sequence, TSO adaper sequence, poly (A/T)n block sequence, Cell Barcodes (CB) and unique molecule identifier (UMI) sequence blocks; 3) Assigner to detect the list of true cell barcodes, merge cell barcodes with sequencing errors and assign raw reads into single cells; 4) Curator to detect reads with true UMIs and collapse them to make consensus sequences of individual molecules to curate sequencing errors on gene bodies; 5) Reporter to detect single cell transcriptomes, single cell gene isoforms and single cell mutations from consensus single cell long reads data.

Keywords

Single cell, Nanopore, RNA sequencing, long read, cell barcode demultiplex, UMI curation, gene expression, isoform, single nucleotide variation

Citing scNanoGPS

Shiau, CK., Lu, L., Kieser, R. et al. High throughput single cell long-read sequencing analyses of same-cell genotypes and phenotypes in human tumors. Nat Commun 14, 4124 (2023). https://doi.org/10.1038/s41467-023-39813-7

Index

Installation

The scNanoGPS pipeline is built with python3. We recommend users to use anaconda/miniconda virtual environment to install it. Refer to Anaconda turorial for environment building.

Build python3 virtual environment

Install scNanoGPS and dependencies

Install other essential tools

scNanoGPS uses the following third party tools for mapping again genome reference, collapsing reads with same UMIs, and sumamrizing single cell gene expression, isoform, and SNV profiles.

Prepare reference genome and annotations

Step 1: NanoQC

Read length distribution

scNanoGPS contains a script named “read_length_profiler.py” to compute the raw read lengths of all reads. The script can either read through individual FastQ/Fast5 files or all FastQ/Fast5 files under a given folder. The raw read length histogram is drawn accordingly.

FastQC (optional)

Per the experimental design of read architecture, the TruSeq Read1, cell barcode (CB), unique molecular identifier (UMI), and polyA tail are expected to locate either in the first or the last 100 nucleotide range of each Nanopore read. Users can use FastQC to check the qualities of the first and last 100 nucleotides of individual Nanopore reads to draw the per-base quality score boxplot.

You can run FastQC to check first_tail.fastq.gz and last_tail.fastq.gz for quality score distribution.

Step 2: Scanner

This step of scNanoGPS pipeline is executed by a python script called “scanner.py”. This script scans for both TruSeq Read1 and polyA tail of the reads. Scanning for other sequence modules are optional. To boost the scanning speed, we scan the first and last 100 nucleotides of reads to recognize TruSeq Read 1 and PolyA. Following by recognition, Scanner extracts CBs and UMIs which are neighbored by TruSeq Read 1 and Poly(A/T)n sequence blocks. Then the Scanner outputs two different files. One is a processed FastQ file holding the insert sequences without TruSeq Read 1, CB, and UMI sequences. The CBs of individual reads are moved to the read names as tags. The other file is table named, “barcode_list.tsv”, storing reads information including read names, CBs, UMIs, and others.

Step 3: Assigner

This step of scNanoGPS pipeline is executed by a python script called “assigner.py”. This script is designed for CB collapsing and estimation of the optimal CB number without guidance of 10X short-read sequencing data or any CB whitelist. To estimate the number of optimal CB, we use edge detection strategy to find out the point where has dramatical signal dropping (Fig. 1b). The detailed method is that the assigner first calculates the supporting UMI number to every CB, and sorts the CB list by UMI number in decreasing order. Following by computing the partial derivatives (slopes) per CB in log10 scale, the medium number of slope changes in log10 scale are computed per 0.001 log10 tick. Then the maximal medium log10 slope change is selected, and where is the crude anchoring for following processes. To fathom the fully signal dropping point and include more useful CBs, we allow 10% more signal in log10 scale. Next, the script collapses CBs which have similar sequences. Previous study shows that the most accurate criterial for CB and UMI collapsing in Illumina samples are three and two Levenshtein Distance (LD), respectively. Here we use two LD to merge similar CB as Refinery Local Optimization. Then a list of representative CB having sufficient supporting read is generated. Alternatively, you can forcely assign cell barcode number by using "forced_no" parameter.

Step 4: Curator

This step of scNanoGPS pipeline is executed by a python script called “curator.py”. This script is used for demultiplexing, filtering, reference genome mapping and re-mapping, and UMI collapsing.

The master FastQ file of all cells is demultiplexed according to the true CB list determined by Assigner into single cell FastQ files each representing one cell. Curator then maps individual FastQ files onto given reference genome by Mimimap2 under splice mode. Chimeric reads from different chromosomes are filtered out in this step (fusion gene detection functionality is under development). Next, Curator scans UMIs through their full length reads by their mapped genomic orders. UMIs that within 2 LD and mapping to same genomic coordinates are considered as same UMI barcodes. To further accommodate possible small indels (<5bp) that causes minor drifting of mapping coordinates, Curator allows 5bp differences to buffer these sequencing errors. To perform parallel computing, the reads scanning is placed into batches based on their genomic coordinates. The reads that share same UMIs are collapsed to generate consensus sequences of individual molecules using software, SPOA. Finally, we re-map the consensus sequences of single cells onto reference genome by using Minimap2 under splice mode. There is a portion of reads that are singletons having only one UMI, which is mapped previsouly. We merged both singleton BAMs with consensus BAMs to formal a final BAMs as curated data.

Step 5: Reporter

Lastly, scNanoGPS contains a set of reporter scripts for generating multi-omics profiles from same single cells with Nanopore long-read sequencing data. This version of scNanoGPS detects the gene expression, isoform, and single nucleotide variations (SNVs) profiles by using FeatureCounts, LIQA, and longshot, respectively.

5.1 Single cell gene expression profile

5.2 Single cell isoform profile

5.3 single cell SNV profile

5.4 Generate final summary table