COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
768 stars 161 forks source link

wrong quantifications on some transcripts due to transposon fragments #217

Open dominik-handler opened 6 years ago

dominik-handler commented 6 years ago

Hi,

first of all, I really like using Salmon for RNAseq quantification. I have a very special problem. We are working with experimental conditions in which transposons get activated and highly expressed. This initially caused my RNAseq quantification to be wrong as transposon reads got assigned to genic transcripts. I can circumvent this problem mostly by adding the transposon transcripts into the index. Still there are some regions that are 1:1 identical between transposons and genes that still cause problems. The gene in the examples is not expressed in the cells I use which is more or less what Salmon gives me for WT conditions. Still I get TPMs of >12 in my experimental conditions due to the heavy transposon deregulation. You can see this here in this chart where I overlay the mappings from salmon in WT and in the experimental knockout (KO): image

The only region covered is the transposon piece and the rest of the transcript is uncovered.

Is there any way to avoid such wrong quantification by some settings, or is it possible to mask such regions upstream?

Any help would be appreciated.

All the best and thank you, Dominik

mdshw5 commented 6 years ago

I’ve noticed the same thing, and have been hard-masking any repetitive sequences in my pipeline (which I’ve been late to open-source and should be available soon): http://mattshirley.com/uploads/2017/11/2017-11-01_Genome_Informatics.pdf

dominik-handler commented 6 years ago

Yes, that is what I started to do as well. I generate reads from the transposon sequences and map them with bowtie allowing no missmatches to the transcriptome. Regions that get covered are masked. Currently I use a read-length of 30 for the transposon masking. How are you doing it?