Bias correction support for mixed paired and unparied reads.

kvittingseerup commented 7 years ago

One problem I often run Into is that the sequence quality of the reads are problematic so I would like to trim the files based on sequence quality before. When I do that, fx with trimmomatic, the result is 4 fastq files. Two where the pairs of reads both survived and two where one mate of the pair did not survive.

Would it be possible to enable the bias corrections when mapping both paired data (with -1 and -2) and the unpaired data (--unmatedReads). Currently when I try this I just get the following error "Cannot combine distributions that live in a different space".

rob-p commented 6 years ago

Hi @kvittingseerup,

Sorry for letting this sit for so long without responding. Currently, Salmon does not support mixed paired-end and single-end library types, so this is presumably what is causing the error (granted, the error message here could be considerably better). Practically, I'd be curious what the difference is between allowing this and simply running Salmon with the non-quality-trimmed paired-end reads. Specifically, if Salmon is not able to map a pair concordantly, but it can map one of the ends of the read, then it will already do so.

However, in the case that there's a really compelling reason to want to quality trim the reads prior to quantification (and to include the reads such that the mate has been completely quality-trimmed away), we would be able to support this. It will require a bit of modification to allow different library types to be processed back to back and to contribute to the same quantification estimates. In this case, I imagine what we would want to do for the orphans is essentially what Salmon would do internally if it can't map the mate. That is, we would learn essentially all of the parameters and biases from the pairs that do map concordantly, and then just include the orphaned reads as indicating an entire fragment but of unknown length.

Let me know if you have any thoughts about the above, and sorry again for the delay!

--Rob

kvittingseerup commented 6 years ago

No worries - and that is exactly what I thought could be possible :-)

Just out of curiosity - how would Salmon currently handle if half of a read could be quasi-mapped to a transcript but the second half did not fit anywhere (due to it being very low quality or sequencing adapter contamination)?

rob-p commented 6 years ago

Right now , the way that this is handled is as follows. If you have a read in a paired end library, where only one of the reads maps, then salmon will assign that fragment (ambiguously and proportionally), to the transcripts where the orphaned read maps. However, in a paired-end library, only properly-paired and concordantly-mapped reads contribute to the estimation of library-specific parameters and biases. Thus, assuming that the low-quality reads that your process would discard are not being mapped anyway, then Salmon's behavior is pretty much what I described above.

kvittingseerup commented 6 years ago

That sounds like a very good way of doing it :-).

I'm sorry I was not clear enough - my question was acutally meant for a single sequence - let me try again: Lets say we have a read pair where one mate maps fine - but the other mate have a problem - half of it is an adapter (or low quality sequence with to many errors). How would Salmon currently handle this situation where the first half of a sequence (e.g. nt 1-50) could be quasi-mapped to a transcript but the second half (nt 51-100) did not match anywhere? Would the the second half cause the whole sequence to be discarded or would it be enough that the first half matched for it to be considered/counted?

rob-p commented 6 years ago

Hi @kvittingseerup,

No need to apologize, I think it was I who was not clear. What I am saying is that this is already the way that Salmon handles such a case. That is, if you have a paired-end read, and one of the reads maps but the other doesn't (due to e.g., adapter contamination or just very low quality), then Salmon will consider the remaining (mapping) end of the read as representative of an entire fragment, and will resolve the fragment origin accordingly during optimization. Generally, not having both ends of a paired-end read leads to increased ambiguity, but this isn't a particularly big problem if it only happens to a generally small fraction of the reads. Further, since you cannot reliably infer the implied fragment length on a transcript from only a single-end read, such mappings will not contribute to the bias model. Again, however, as long as this doesn't happen to the vast majority of fragments, it should have only a negligible effect on quantification and bias correction. Please let me know if this description makes sense.

Best, Rob

kvittingseerup commented 6 years ago

Hi @rob-p

Thanks for the elaborate answer - makes a lot of sense.

The problem is that adapter contamination typically occures because the fragments were smaller than the sequence length we sequence into the adapters - and it can occur for a larger fraction of the reads (I've seen up to 50% of reads affected in the 3'end) making it non-negligible. That is why I suggested the extension in the first place.

I think it makes a lot of sense to trim adapters away - both because they reduce the number of compatible reads - mostly because the failure to do so will result in an overestimation of the fragment length.

Now that I think about it I don't think we should trim reads based on quality as that will lead to an underestimation of the read length - or what do you think?

COMBINE-lab / salmon

Bias correction support for mixed paired and unparied reads. #150