COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
777 stars 165 forks source link

CIGAR in sam file not showing mismatches #491

Open mej54 opened 4 years ago

mej54 commented 4 years ago

Hi,

I'm looking at RNAseq data and am trying to identify the number of reads that map to an expressed transgene by using salmon in the quantification mapping mode. I built the reference using a kmer of 15 on a fasta that contains ~1 kb of the target transgene region as the reference sequence.

I started looking at reads that have only one hit to this gene/barcode region (NH:i:1) and I noticed that a lot of these reads have a CIGAR string of 151M, but when I look at them in IGV aligned to my reference sequence, they only match a portion of the reference and the rest of the read contains a lot of mismatches. I'm wondering why in this case the read would say 151M when it contains these mismatches in alignment? Are the CIGAR strings in the salmon *.sam files accurate? I have attached an IGV screenshot for reference.

I ran salmon (v0.9.1) using the following command: salmon quant -i $REF_INDEX -l U -r <(gunzip -c $R2_FASTQ) --writeMapping=$OUT_SAM -o $OUT_QUANT --threads 28

Thank you for any help you may be able to provide!

Screen Shot 2020-03-10 at 3 10 30 PM
rob-p commented 4 years ago

Hi @mej54,

First, thanks for using salmon and for providing detailed feedback! There are two main points I'd like to make in response to the points you raise.

First, v0.9.1 is very old, and there have been a large number of bug fixes and substantial improvements to salmon since that version (though it's much better than people who are still using 0.8.2 from, like, 3 years ago!). Specifically, I'd highly recommend upgrading to the latest version (1.1.0, with 1.2.0 coming out shortly). We've added (and made standard) selective-alignment, which is a procedure that provides alignment scoring for the assigned reads to avoid spurious mappings that arise with fast lightweight mapping procedures.

Second, the observation of mismatching bases at the provided alignment location is the expected behavior with the mappings written by salmon with the --writeMappings option. Specifically, while newer versions of salmon (0.15.0 and greater) will do alignment scoring and removal of low score alignments by default, salmon still does not compute or write out a full CIGAR string for its alignments. Instead, it uses a score-only dynamic program to compute the optimal alignment score at the given location, but it "spoofs" the CIGAR string. Thus, if there is e.g. a small indel in the read, this will show up in an IGV visualization as a large number of mismatches after that indeed location. I'm not sure that is what is happening in the screenshot you show above, and, in fact, may of these mappings may disappear with selective-alignment. However, it will definitely still be possible to see a cigar string showing full matches, where there are mismatches in IGV. This is intended behavior due to score-only alignment. However, it's also true that newer versions of salmon will report the alignment score in an AS tag, so that you can see how high the alignment quality was at the particular location.

mej54 commented 4 years ago

Hi @rob-p,

Thank you for your response/explanation about the CIGAR strings. I had a feeling this was the case based on previous issues I had been reading, but I just wanted to make sure I was understanding correctly about the CIGAR output. I will update to the latest version of Salmon and will check if the selective-alignment helps with these false mappings.

Thanks again!

epruesse commented 2 years ago

@rob-p - How much effort would it be to run a DP with tracing of the alignment? I remember reading that Brian Bushnell managed to fit everything into the CPU cache for BBMap's alignment algorithm, at least for regular read lengths, so the performance impact should be acceptable if done right. Not sure about licenses, but I think there was an optional native C implementation for it in his code base. This would be great to have so the SAM can be used for e.g. basic genotyping for QC purposes.

Alternatively, maybe add a line to the docs of --writeMappings to make sure everyone understands the read alignments will have a score and position, but lack actual alignment and appear as if they were perfect matches.