Library type - Githubissues

biounix commented 8 years ago

Hi Rob,

I'm testing Salmon 0.6.0 in read-based mode and it seems awesome, fast and accurate. Moreover, it's very well documented, which I really appreciate.

My question is about the library type specification. In the libFormatCounts.txt file, library type as well as consistent and inconsistent mappings are correctly reported. However, transcript TPM and NumReads in quant.sf file are the same regardless of the library type specified when playing with different library types combinations. I would expect Salmon to take only into account the consistent mappings for the quantification. I'm specifying the -l parameter before the -1 and -2 parameters.

I'm pretty sure that I'm missing something. I would really appreciate if you could shed light on this issue.

Thanks in advance.

rob-p commented 8 years ago

Hi @gresteban ,

Thanks for the kind words. I'm working on improving the documentation even more for v0.7.0, which should land soon.

Regarding your question, what you're seeing is expected behavior. That is, for the vast majority of transcripts, Salmon will simply do the "right thing" regardless of the library type. This is because the library type is used as a "soft" rather than a "hard" filter when determining where a read may originate from (i.e. Orientations other than the expected type have a probability orders of magnitude smaller than the expected type, but still non-zero). Thus, if the only mapping for a read disagrees with the expected type, it will still be used. There is a way to modify this behavior, but since stranded library prep is imperfect, the default behavior is the most reasonable for most situations.

The reason that you'll see consistency in most cases, regardless of the library type, is as follows. Imagine that I have a read that maps to transcript 1 in the forward orientation and transcript 2 in the reverse orientation. Further, imagine I have a stranded library, and I expect all reads to map in the reverse orientation. If the mapping to transcript 1 is "spurious", there are unlikely to be many othe reads mapping to that transcript in this manner, while we would expect other reads to map to transcript 2 in the prescribed manner. Since Salmon considers all of the reads in its probabilistic model when deciding how each read should be allocated, the fact that many reads map to transcript 2 will increase its abundance and, likewise, increase the probability that we assign this read to transcript 2 --- that is, the other mappings will help us make the right choice, regardless of the fact that we neglected to assign a stranded library type.

That said, there are situations where the library type makes a difference. This is most often for a few transcripts that are very sequence similar (e.g. Paralogs that happen to be on opposite strands). In this case, most of the reads that map to one transcript will map to the other as well. In this case, the much larger conditional probability of agreeing with the prescribed library type will cause these reads to be allocated to the transcript to which they map in the expected orientation. However, the fraction of such transcripts is usually a small proportion of all expressed transcripts in an experiment, which is why, even if you do have a stranded library and some strand-specific expression, you'd expect the overall concordance to be very high between runs with different provided library types. Let me know if this answers your question, and if you have any others.

Best, Rob

biounix commented 8 years ago

Sure! That totally answers my question (and more). It seems that Salmon does even better than I could guess.

Many thanks for your detailed explanations.

AK443 commented 6 years ago

Hi Rob, I'd like to ask a follow-up question to this thread:

In your reply above, you said:

the library type is used as a "soft" rather than a "hard" filter when determining where a read may originate from (i.e. Orientations other than the expected type have a probability orders of magnitude smaller than the expected type, but still non-zero). Thus, if the only mapping for a read disagrees with the expected type, it will still be used. There is a way to modify this behavior, but since stranded library prep is imperfect, the default behavior is the most reasonable for most situations.

I wonder how can I enforce the "correct" usage of the strand information by Salmon. I am testing Salmon on some data and there seem to be cases of overlap between two genes (on opposite strands), when the values produced by Salmon seem suspicious.

Best, Alex

rob-p commented 6 years ago

Hi Alex,

The appropriate way to force salmon to use the library type as a hard constraint is to pass the option --incompatPrior 0.0 on the command line. This will tell salmon that it should consider a fragment mapping different than the library type to be impossible (i.e. this mapping should simply be discarded). This will actually be the default behavior starting from the next release anyway, as the current behavior seems to confuse more people than not.

Best, Rob

AK443 commented 6 years ago

Thanks, it worked.

COMBINE-lab / salmon

Library type #67