COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
774 stars 164 forks source link

Perform salmon on only UTR region #248

Closed obenno closed 6 years ago

obenno commented 6 years ago

Hi,

I'm working on alternative polyadenylation analysis on transcripts. We would like to calculate expression on different alternative UTRs sequences of the same gene set. As some UTRs are very short (even shorter than average RNA-Seq fragments length), a lot of reads may fail to concordantly align to reference sequence. May I ask what options should I use to rescue orphan reads when calculate TPM?

Thanks a lot. obenno

obenno commented 6 years ago

Thank you in advance for developing this amazing thing.

I have some little suggestions. I have little background on algorithm used in salmon, and need to give a quick test on the data. I read through the documents, and it's very difficult for me to understand the rational running behind for some options without reading series of original algorithm papers. I guess some import details are missed in the document, for example, what's the difference of quasi-mapping model and light-weight alignment based model, and how salmon deal with pair end reads that not concordantly aligned. To be honest, it's not very friendly to end-user of this tool, and prevents some of us using it extensively.

obenno

rob-p commented 6 years ago

Hi @obenno,

Sorry for the delayed response. Regarding your first question, salmon automatically allows discordant mappings of a read if there are no concordant mappings. That is, if there is no mapping that accounts for both end of a paired-end read, then salmon will automatically accept the orphaned mapping. This behavior can be turned off with the --discardOrphansQuasi flag, which tells salmon to not allow orphan mappings.

Regarding your second question. I'm sorry that you find the relevant options difficult to understand. You're absolutely correct that there is a lot going on in the software, and that the array of different flags and potential parameters can being quite confusing to a new-comer. In our FAQ we try to answer some of the most frequently asked questions. Also, to help understand what some of the most important options are, and how you might want to set them (if not using the defaults), the ReadTheDocs page has a description of important options section. If there's a particular question you have that's not answered in one of these places, I'd be happy to do so (and add it to the docs for future reference).

obenno commented 6 years ago

Hi @rob-p,

Thanks for your reply. I read the documents description on --discardOrphansQuasi, and I need to count all the mappings including concordant pairs as well as orphans, as some of the target region may be shorter than sequencing fragments. I guess this cannot be achieved on salmon right now, is it? I opened this issue since I cannot make sure this after reading the documents, sorry about the inconvenience caused. Thanks.

rob-p commented 6 years ago

Hi @obenno,

Maybe i am misunderstanding your desired behavior. The default behavior of salmon is to report and account for orphan mappings. That is, orphans are not discarded by default. However, orphan mappings will only be produced if there are no concordant mappings for a read. The idea is something like this: search for concordant mappings; if we find some, report them; otherwise search for orphan mappings; if we find those then report them; if we find neither, leave the read unmapped. There is not, so far as I know, an alignment tool that will report orphan alignments for a read if there also exist concordant alignments for the same read. Is that the behavior you are looking for?

Best, Rob

obenno commented 6 years ago

Hi @rob-p,

Yes, that is what I exactly need. I'm studying only UTR region (potential UTR region were extracted from genome sequence according their coordinates) expression.

For example, img_20180730_105251

Now salmon will only report 2 reads from a fragment, but if the target region are very short, a lot of reads will be discarded as they are orphans, and there will be a bias for TPM estimation for short targets. I did some search and found kallisto's option --single-overhang can report those orphans, if I didn't misunderstood the description. Haha, anyway, thanks a lot for replying.

Regards, obenno

rob-p commented 6 years ago

Hi @obenno,

In your figure, are a,b,c and d different fragments? If so, salmon will report 4 fragments mapping to this reference sequence as long as there is no other reference sequence accounting for both ends of b, c, and d. It would be possible, of course, to always report all orphans, even when concordant mappings exist, but one would expect this would lead to a lot of spurious mappings and artifactual expression.

obenno commented 6 years ago

Hi @rob-p,

You mean b, c, d will also be counted when a exists? If this is the default behavior for salmon, that might be my misunderstanding on "orphans", sorry for that. And this seems like a little different from your previous description? It's said that orphans will only be searched and reported when no concordant mapping exists... Sorry that I'm so verbose..

Bests, obenno