[x] Literature review of spliced short read alignment methods (is anyone using Minimap2?)
[x] Review Rust libraries for alignment and minimizers
[x] Make stand-alone command line Rust aligner
[x] Index the genome using an FMD index
[x] Output the chromosome and position in PAF or SAM format
[x] Align reads to each overlapping transcript using bio::alignment::pairwise::banded
[x] Support mismatches by seed-and-extend alignment for transcriptome
[x] Extend alignment of all seed matches and select best alignment
[x] Lift transcriptome alignment coordinates back to genome coordinates
[x] Integrate Thermite into Cell Ranger (aligner = "thermite")
[x] Support mismatches by seed-and-extend alignment for genome
[ ] Reduce alignment differences between Thermite and STAR
[ ] Count the number of occurrence (Occ) array lookups and cache misses
[ ] Benchmark and measure time spent finding seeds, DP alignment, I/O, everything else
[x] Reduce memory usage of mapping/aligning
[ ] Reduce memory usage of indexing (maybe)
[ ] Implement chaining of alignments over intron gaps to discover novel splicing, for cellranger count include-introns (maybe)
[ ] Use minimizers rather than SMEM for seeds (maybe)
Use ntHash or similarly fast hashing algorithm. Other of ntHash is previous colleague Hamid, and new 10x employee Luiz Irber ported it to Rust. https://github.com/luizirber/nthash
Testing
[x] Create a test data set using the human mitochondrion
[x] Create a test data set using the human chr21
[x] Create a test data set using the complete human genome (using Cell Ranger)
[x] Compute fraction of Thermite and STAR alignments that are identical
[x] Compute fraction of Thermite and STAR alignments that intersect
[ ] Compare the Thermite and STAR gene counts
[ ] Run Thermite on a test battery using Plygo
[ ] Run Thermite and Cell Ranger on the synthetic RNA test data set
Purpose
Alignment for quantification, not discovery. Don't need to find novel splice junctions. Transcriptome is an input.
Caveats
Customers that use the BAM file for purposes other than quantification. Could run a different aligner.
Customers with species with poorly annotated transcriptomes.
Requirements
Align spliced short reads (~90 to 150 bp) to a reference genome and transcriptome
Assign reads to genes (GX:Z BAM tag)
Produce a BAM file
Exceed sensitivity, specificity, and speed of STAR
Assign reads to transcripts (TX:Z BAM tag)
Stretch Goals
Align spliced long reads to a reference genome and transcriptome (keep it in mind when making design choices)
EM to redistribute multimapped reads (well beyond scope, but keep it in mind when making design choices, particularly regarding multimapped reads)
Implementation
Use a minimizer index
Index the reference genome
Index the reference transcriptome to improve sensitivity of spliced reads, particularly with short exons
Indexing is ideally fast enough to index on the fly, to support custom references more conveniently, but able to make use of a pre-computed index if available for supported species like human and mouse
Benchmarking
Identify benchmark RNA data sets: simulated data, synthetic data, and real data
Measure sensitivity and specificity
Measure performance and resource requirements
The vast majority of alignments ought to be identical to STAR
Allow antisense alignments to transcripts in intron mode
Replace TranscriptAnnotator in Cellranger to use annotation info thermite produces and allow BAM output to be disabled
Take into account TSO and polyA trimming that Cellranger does when producing alignments used in CI (Thermite directly aligns untrimmed reads)
Implement DP-based chaining to discover novel splicing
Optimize for when intron mode is turned off. With better Cellranger integration in no bam mode, it may be possible to only compute gene mappings for each read.
Only outputs one alignment per transcript (no need to output more for feature barcode counts).
To Do
bio::alignment::pairwise::banded
aligner = "thermite"
)cellranger count include-introns
(maybe)Testing
Purpose
Alignment for quantification, not discovery. Don't need to find novel splice junctions. Transcriptome is an input.
Caveats Customers that use the BAM file for purposes other than quantification. Could run a different aligner. Customers with species with poorly annotated transcriptomes.
Requirements
Stretch Goals
Implementation
Benchmarking