sjackman commented 3 years ago

To Do

[x] Literature review of spliced short read alignment methods (is anyone using Minimap2?)
[x] Review Rust libraries for alignment and minimizers
[x] Make stand-alone command line Rust aligner
[x] Index the genome using an FMD index
[x] Output the chromosome and position in PAF or SAM format
[x] Align reads to each overlapping transcript using bio::alignment::pairwise::banded
[x] Support mismatches by seed-and-extend alignment for transcriptome
[x] Extend alignment of all seed matches and select best alignment
[x] Lift transcriptome alignment coordinates back to genome coordinates
[x] Integrate Thermite into Cell Ranger (aligner = "thermite")
[x] Support mismatches by seed-and-extend alignment for genome
[ ] Reduce alignment differences between Thermite and STAR
[ ] Count the number of occurrence (Occ) array lookups and cache misses
[ ] Benchmark and measure time spent finding seeds, DP alignment, I/O, everything else
[x] Reduce memory usage of mapping/aligning
[ ] Reduce memory usage of indexing (maybe)
[ ] Implement chaining of alignments over intron gaps to discover novel splicing, for cellranger count include-introns (maybe)
[ ] Use minimizers rather than SMEM for seeds (maybe)
- Use ntHash or similarly fast hashing algorithm. Other of ntHash is previous colleague Hamid, and new 10x employee Luiz Irber ported it to Rust. https://github.com/luizirber/nthash

Testing

[x] Create a test data set using the human mitochondrion
[x] Create a test data set using the human chr21
[x] Create a test data set using the complete human genome (using Cell Ranger)
[x] Compute fraction of Thermite and STAR alignments that are identical
[x] Compute fraction of Thermite and STAR alignments that intersect
[ ] Compare the Thermite and STAR gene counts
[ ] Run Thermite on a test battery using Plygo
[ ] Run Thermite and Cell Ranger on the synthetic RNA test data set

Purpose

Alignment for quantification, not discovery. Don't need to find novel splice junctions. Transcriptome is an input.

Caveats Customers that use the BAM file for purposes other than quantification. Could run a different aligner. Customers with species with poorly annotated transcriptomes.

Requirements

Align spliced short reads (~90 to 150 bp) to a reference genome and transcriptome
Assign reads to genes (GX:Z BAM tag)
Produce a BAM file
Exceed sensitivity, specificity, and speed of STAR
Assign reads to transcripts (TX:Z BAM tag)

Stretch Goals

Align spliced long reads to a reference genome and transcriptome (keep it in mind when making design choices)
EM to redistribute multimapped reads (well beyond scope, but keep it in mind when making design choices, particularly regarding multimapped reads)

Implementation

Use a minimizer index
Index the reference genome
Index the reference transcriptome to improve sensitivity of spliced reads, particularly with short exons
Indexing is ideally fast enough to index on the fly, to support custom references more conveniently, but able to make use of a pre-computed index if available for supported species like human and mouse

Benchmarking

Identify benchmark RNA data sets: simulated data, synthetic data, and real data
Measure sensitivity and specificity
Measure performance and resource requirements
The vast majority of alignments ought to be identical to STAR
Differences from STAR should have an explanation

sjackman commented 3 years ago

Algorithm and Implementation Proposal

Identify all SMEMs between the read and reference genome (using Rust-bio FMD index)
Identify the annotated transcripts that overlap those SMEMs
Align the read to those transcripts (using Rust-bio pairwise alignment)
Extend the genomic alignment of SMEMs that are non-transcriptomic (using Rust-bio pairwise alignment)
Retain the alignments with the maximal alignment score, discard the rest
Assign the read to the transcripts with the maximal alignment score, possibly none if the best alignment score was non-transcriptomic
Stop if BAM file is not required (cellranger count --no-bam)
Lift over transcript alignment coordinates to spliced genomic alignment coordinates to produce the BAM file

Some interesting possible variations on this proposal are…

including transcript sequences in the FMD index
using minimizers rather than FMD index to identify the set of candidate transcripts

Daniel-Liu-c0deb0t commented 3 years ago

Future Work

Allow antisense alignments to transcripts in intron mode
Replace TranscriptAnnotator in Cellranger to use annotation info thermite produces and allow BAM output to be disabled
Take into account TSO and polyA trimming that Cellranger does when producing alignments used in CI (Thermite directly aligns untrimmed reads)
Implement DP-based chaining to discover novel splicing
Optimize for when intron mode is turned off. With better Cellranger integration in no bam mode, it may be possible to only compute gene mappings for each read.
Only outputs one alignment per transcript (no need to output more for feature barcode counts).

10XGenomics / thermite

Plan #1