alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

(STARsolo's) UR and UB SAM flags differ in length #1909

Open johnmoll opened 1 year ago

johnmoll commented 1 year ago

Hi,

First of all, thanks for your amazing tool(s), I find STAR and STARsolo extraordinarily useful. This is to report that STARsolo's UB and UR don't have the same length when processing BDRhapsody data. The manual reads (page 13):

CR CY UR UY : sequences and quality scores of cell barcodes and UMIs for the solo* demulti-
plexing, not error corrected.
CB UB : error-corrected cell barcodes and UMIs for solo* demultiplexing.

so I expected UR and UB to share length, perhaps I misunderstood. If they were intended to share length, I might have found a bug. The snippet below shows how to generate an output where their lengths don't match.

STAR 2.7.10b, compiled from source, running on Ubuntu 18.04.6 LTS.

#!/bin/bash

NTHREADS=5

mkdir sandbox; cd $_

## these are whitelists (BDRhapsody)
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS1.txt
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS2.txt
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS3.txt

## fake, short genome
cat << EOF > genome.fa
>ERCC-00002
TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGGGCGTCG
GCATCCAGACCGTCGGCTGATCGTGGTTTTACTAGGCTAGACTAGCGTAC
GAGCACTATGGTCAGTAATTCCTGGAGGAATAGGTACCAAGAAAAAAACG
AACCTTTGGGTTCCAGAGCTGTACGGTCGCACTGAACTCGGATAGGTCTC
AGAAAAACGAAATATAGGCTTACGGTAGGTCCGAATGGCACAAAGCTTGT
TCCGTTAGCTGGCATAAGATTCCATGCCTAGATGTGATACACGTTTCTGG
AAACTGCCTCGTCATGCGACTGTTCCCCGGGGTCAGGGCCGCTGGTATTT
GCTGTAAAGAGGGGCGTTGAGTCCGTCCGACTTCACTGCCCCCTTTCAGC
CTTTTGGGTCCTGTATCCCAATTCTCAGAGGTCCCGCCGTACGCTGAGGA
CCACCTGAAACGGGCATCGTCGCTCTTCGTTGTTCGTCGACTTCTAGTGT
GGAGACGAATTGCCAGAATTATTAACTGCGCAGTTAGGGCAGCGTCTGAG
GAAGTTTGCTGCGGTTTCGCCTTGACCGCGGGAAGGAGACATAACGATAG
CGACTCTGTCTCAGGGGATCTGCATATGTTTGCAGCATACTTTAGGTGGG
CCTTGGCTTCCTTCCGCAGTCAAAACCGCGCAATTATCCCCGTCCTGATT
TACTGGACTCGCAACGTGGGTCCATCAGTTGTCCGTATACCAAGACGTCT
AAGGGCGGTGTACACCCTTTTGAGCAATGATTGCACAACCTGCGATCACC
TTATACAGAATTATCAATCAAGCTCCCCGAGGAGCGGACTTGTAAGGACC
GCCGCTTTCGCTCGGGTCTGCGGGTTATAGCTTTTCAGTCTCGACGGGCT
AGCACACATCTGGTTGACTAGGCGCATAGTCGCCATTCACAGATTTGCTC
GGCAATCAGTACTGGTAGGCGTTAGACCCCGTGACTCGTGGCTGAACGGC
CGTACAACTCGACAGCCGGTGCTTGCGTTTTACCCTTAAAAAAAAAAAAA
AAAAAAAAAAA
>ERCC-00003
CAGCAGCGATTAAGGCAGAGGCGTTTGTATCTGCCATTATAAAGAAGTTT
CCTCCAGCAACTCCTTTCTTAATTCCAAACTTAGCTTCAGTTATAAATTC
CCCTCCCATGATTGGGATTTTATAAACTTTTCTTCCATATAATTCATCTT
TCTTCTCATAACCGTCTCCGAAAAACTTCAACTTAAATCCAACCTTTAAC
TGCTCATCAGCCATGTCTCCCACAGCATCAAAAATAGCAGTTGTTGGACA
TGTTAAGACACACTGCCCCAATCTCTCTAACATTTGATGCTCTAACTCTG
ACTTTTTAGGGTGGCATATCTGTATTATAAATCCTGGTCTTCCATCTGGT
GTTTTTGATGGAGGGACATATTTCTCAATTCCTGCTTCTGCTGGACACAT
TATAACTGAACAACCAAAACCTGTTGCCTCTGTAGCTGCAATCTTAGCCC
ACTTCTTTGTAGCTGCTGTTATTAAAACTCTTGAAACCCATATTGGGAAT
GCTTCTGCAAATGTATCTTCAATATATACTCCATTTATTTCCATAGTTTC
CCTCCATTAAGATTTTAACAATTATAGTTTATCTTAGGGGCTATTAATAT
CTTATCATTTGGTTTTTAATATTCGATAAATCCATAAATAAAAATATATC
AACAATAATTTTAAATAATCTAAGTATAGGTAATATAACAATTAAAAAGA
TTTAGAGGGATAGAATTGAACGGCATTAGGAGAATTGTTTTAGATATATT
GAAGCCGCATGAGCCAAAAATAACAGATATGGCATTAAAATTAACATCAT
TATCAAACATTGATGGGGTTAATATTACAGTCTATGAAATAGATAAAGAG
ACTGAGAATGTTAAAGTTACAATTGAAGGGAATAATTTAGATTTTGATGA
GATTCAGGAAATTATTGAAAGTTTGGGAGGGACTATTCACAGTATAGATG
AGGTTGTTGCAGGTAAAAAGATTATTGAAGAGTTAGAACACCACAAGATA
AAAAAAAAAAAAAAAAAAAAAAA
>ERCC-00004
TCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCATTGGTATCTTTTCCC
CACTTCCAAGCATTTTTTCAACTAATCTTATGTTATTAACCATTTCCTTA
AATTCTTCTGGGTCTGCTGACAAAGCATGATCAGGACCTTCCATATTTTT
ATCTAAGGTAAAGTGCTTCTCAATAACATCCGCTCCTAAGGCAACAGAAA
CTACTGGGGCGAGTATTCCCAATGTATGGTCAGAATATCCCACAGGGATA
TTGAATATACTTTTCAAGGTTTTAATAGCGTTTAAATTGACATCTTCATA
AGGGGTTGGGTAAGATGAAATACAATGCAATAAAATAATATCCCTGCATC
CATTATTTTCTAAAACTTTAACTGCTTCCCAAATTTCCCCAATATCAGAC
ATTCCTGTAGATAAAATCACCGGCTTGCCTGTTTTTGCCACTTTTTCTAA
TAAGGGATAAAAGGTTAAATCACCAGAGGCAATTTTAAATCAGGCACATA
AAAAAAAAAAAAAAAAAAAAAAA
>ERCC-00009
CAATGATAGGCTAGTCTCGCGCAGTACATGGTAGTTCAGCCAATAGATGC
CTAGTACGCTGACGGCATTCAGAGTACGCTGATCGGCTTATGACGTATGT
GACGCAGCTCTTAGCGCAATGTATGTGCTGTTATCGAAGCCTATGGCTGA
GTATGTAACGCTATGGCGTGCTAGTCGTCTCATATACGTCTGATGACCTC
GTATCATGTTATAGGGCTGCGAACTGTCGATGATGGTCACGACTCTGTCG
ATAGCTGTGTGACTCATTCAGAAGGTGTGCAGCCTATATGATACGCAGTC
GCATCCTATCTTACGTGTCAGTACTATGTGTGAGTGCTCCGCCCTAGTGC
TGATGTATGCCCCATAGTGCTCAGTGGAGTCTCTCTTAGCATAGTGTCCG
CTCATACATTAGATGGACGGCTCATTAGTATCATCGTCGGCTGATATAGG
TCGTGGCTCCCTGTATATCGAGGTGAGTCTATCTGGATCAACGTCGCACT
ATGATGTGCAAAGTGTCGTCCATGTATAGACAGTGCGCGTATCATATAGG
ATGCGGCGATCTCATACAGCGTTACGGTCGCTGCGTACTGTATAAGGATG
CTCTGTGAACTGTCATCGGTCCGATCAATTAGTCTAGTGTGCGTTATTCA
GATCGAGTGAGTACATGATTCGTCAGTGTGGATCAATTACAGTTAGGCCG
CTGACACATTAGTAACGTCGGCAAGCACTTAGTCGTGTCGTAAGCCAGTG
TGTCGTGTCTTAGACGACTGTGTGTGATTCTCGAGCGATTTATACATCCG
TGACAGCGCTTATAGTGTGCTGACAGACTGGTTGGTTATCCAATGATCGA
CCTGGAGTCTAATATCTGACCACGCCTTGTAATCGTATGACACGCGCTTG
ACACGACTGAATCCAGCTTAAGAGCCCTGCAACGCGATATACAGGCGCTG
CTACCGATATAAAAAAAAAAAAAAAAAAAAAAAA
EOF

## fake, short canonical transcripts
echo -e 'ERCC-00002\tERCC\texon\t1\t1061\t.\t+\t.\tgene_id "ERCC-00002"; transcript_id "DQ459430";
ERCC-00003\tERCC\texon\t1\t1023\t.\t+\t.\tgene_id "ERCC-00003"; transcript_id "DQ516784";
ERCC-00004\tERCC\texon\t1\t523\t.\t+\t.\tgene_id "ERCC-00004"; transcript_id "DQ516752";
ERCC-00009\tERCC\texon\t1\t984\t.\t+\t.\tgene_id "ERCC-00009"; transcript_id "DQ668364";' > genome.gtf

## index
STAR --runThreadN $NTHREADS \
     --runMode genomeGenerate \
     --sjdbGTFfile genome.gtf \
     --genomeDir a_genome \
     --genomeSAindexNbases 4 \
     --sjdbOverhang 100 \
     --genomeFastaFiles genome.fa

## generate 100 R1s (cDNA) and R2s (CB + UMI)
seq 1 100 | awk ' {print "@"$0"\nTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCATTGGTATCTTTTCCCCACTTCCAAGCATTTTTTCAACTAATCTTATGTTATTAACCATTTCCTTAAATTCTTCTGGGTCTGCTGACAAAGCATGATCAGGACCTTCCATATTTTT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF";}' | gzip -c > r1.fq.gz

seq 1 100 | awk ' {print "@"$0"\nCTTGTACTAGTGATGTTCTCCAGACAGGCTACAGATTTGATGGTTTTTTTTTTTTTTTTT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF";}' | gzip -c > r2.fq.gz

## run starsolo; UB and UR reported explicitly, among other flags
STAR --runThreadN 5 \
     --genomeDir a_genome \
     --readFilesCommand zcat \
     --outFileNamePrefix mapping_test/ \
     --readFilesIn  r1.fq.gz r2.fq.gz  \
     --soloType CB_UMI_Complex \
     --soloAdapterSequence GTGANNNNNNNNNGACA \
     --soloCBposition 2_-9_2_-1 2_4_2_12 2_17_2_25 \
     --soloUMIposition 3_10_3_17 \
     --soloCBwhitelist BD_CLS1.txt BD_CLS2.txt BD_CLS3.txt \
     --soloCBmatchWLtype 1MM \
     --soloCellFilter None \
     --soloCellReadStats Standard \
     --sjdbOverhang 100 \
     --outSAMattributes sS sM CB UB CR CY UR UY \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode GeneCounts \
     --sjdbGTFfile genome.gtf

## UB looks like UR, but with a prepended AA dinucleotide
samtools view mapping_test/Al*bam | cut -f16- | head
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG

Thanks again for your time,

John

alexdobin commented 1 year ago

Hi John,

This is indeed a bug; thanks a lot for the detailed test example. The length of the UB tag for complex barcodes is mistakenly taken from --soloUMIlen (default = 10). I will fix it in the next release, for now, as a workaround, please specify the correct length --soloUMIlen 8.