COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
780 stars 165 forks source link

FASTQ position usage by Salmon #101

Open ZachHunter opened 8 years ago

ZachHunter commented 8 years ago

This is not exactly a bug, but a comment and a question regarding how Salmon uses the positioning data in fastq files. We had a series a RNASeq samples where the majority of the reads were listed at 0:0 in the fastq file. We think this is some obscure issue with one of the trimming/demultiplexing pipelines. No one noticed, as this data is not generally used, but it did throw an error with rsem. Luckily, this error had been previously reported.

Notably, Salmon using quasi mapping was fine. It was only when I tried again using STAR aligned bam files that I noticed that only those reads not listed at 0:0 were used by Salmon (STAR does not seem to care one way or the other). Obviously, badly formated fastq files do not constitute a bug and we are working on fixing them, but we were curious why the positioning data was being used in alignment mode but not quasi mode. Moreover, why is it being used at all? Is it used to weed out potential artifacts?

Many thanks and happy to share an example file if your are interested.

rob-p commented 7 years ago

Hi Zach,

I apologize for taking so long to get back to you on this. It fell of my radar and somehow I'd forgotten I hadn't responded yet. This is not actually intentional behavior, as alignment-based Salmon doesn't explicitly make use of this positional information. However, before using alignments, it does check that the alignments don't have QC fail flags set though (e.g. the dup or QC fail flags). Perhaps that is the problem? I'd be happy to take a look at some sample data and figure out what's going on. I can probably get to that early next week. Sorry again for the slow reply!

--Rob