COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
761 stars 160 forks source link

Gene fusions #52

Open schelhorn opened 8 years ago

schelhorn commented 8 years ago

Wicked fast indeed! Are there any plans to extend salmon to also detect gene fusion events? There isn't a fast and accurate way to do that yet, only approaches requiring full alignments. Most often a base-perfect breakpoint isn't required, an estimate within a hash length is fine. We are a heavy user of bcbio and are also running the full STAR alignment just for gene fusions, which really sucks. Any ideas would be much appreciated.

rob-p commented 8 years ago

Hi @schelhorn,

Yes; we are actively looking at fusion prediction based on quasi-mapping. The initial results are promising, but we're still working on improving and refining the method. I'll be sure to let you know when we have something that is ready to test :).

Best, Rob

schelhorn commented 8 years ago

Excellent. May I point out that tools such as Oncofuse https://github.com/mikessh/oncofuse/ and Pegasus https://github.com/RabadanLab/Pegasus have a particular, additional value since they provide functional annotation of fusion events identified by other approaches? Also, these resources may prove helpful wrt validation data: https://github.com/chapmanb/bcbio-nextgen/issues/210 and http://m.genome.cshlp.org/content/early/2015/11/10/gr.186114.114 Adding @roryk here for highlighting this feature request in bcbio.

rob-p commented 8 years ago

Awesome; thanks for the pointers! We'll definitely take a look at these.

schelhorn commented 7 years ago

Hello @rob-p, may I ask whether there are any news concerning gene fusion detection in Salmon?

rob-p commented 7 years ago

Hi @schelhorn,

Yes, we have built a pipeline atop salmon and quasi-mapping. At this point, what we see is that it is very fast with high sensitivity. Our main focus has been on improving the specificity, which is current better than some, but not all methods. I realize, of course, that false-positives are a very difficult (and key) problem in this domain, so I'd really like to make sure they are well-handled.

schelhorn commented 7 years ago

Great; would you like help testing the pipeline, and integrating it into bcbio? We could help with both :)

schelhorn commented 7 years ago

Also, do you know if the Salmon pseudo-BAM is suitable for fusion calling by standard (alignment-based) fusion calling tools, ie does the BAM include information on mate pairs mapped across transcripts, or reads spanning breakpoints?

rob-p commented 7 years ago

Hi @schelhorn,

Sorry for the uncharacteristically slow response on this. We're going full steam ahead for the RECOMB deadline, so I've been less responsive than usual. Anyway, I've invited you to the repository for the fusion project (it's currently private). Feel free to poke around, but it's probably not useful until we can send you a short writeup describing the current pipeline (since things are still very "alpha"). Regarding calling fusions from the sam output of Salmon, one can't do this directly because there are, by default, no encompassing reads (i.e. individual reads split between transcripts) and, to improve abundance estimation, salmon is conservative with it's use of spanning reads. However, we can get at this information from quasi-mapping, so I can definitely consider adding some flags to provide this info (this is the type of thing we output in the fusion pipeline currently, and then we have to postprocess it).

schelhorn commented 7 years ago

Excellent; thank you. We'll have a look and see what we can contribute.

schelhorn commented 7 years ago

Hello @rob-p, could you please invite @tetianakh to the repo as well? She'll do the development on our end. Thanks!

rob-p commented 7 years ago

Hi @schelhorn,

Sure, I'll had her now. We'll send you a small write-up about the state of the codebase and how to run the current pipeline next week (once my student is back from the current CSHL meeting with all of the cool kids ;P).

schelhorn commented 7 years ago

Sweet!

roryk commented 7 years ago

Hi Rob,

Could I get in on this? We have a couple projects needing to call fusions on a large amount of samples, and it would be great to have something speedy to iterate on.

schelhorn commented 7 years ago

FYI, I also asked in the kallisto project: https://github.com/pachterlab/kallisto/issues/122

tetianakh commented 7 years ago

Hi @rob-p, I haven't received an invitation to the private repo. Could you please invite me? Thanks!

rob-p commented 7 years ago

Hi @tetianakh, I've re-sent the invitation. If you don't get it, please send me an e-mail, and I'll reply with the link to join directly.

tetianakh commented 7 years ago

Thanks, I've received it now.

rob-p commented 7 years ago

Great :). I'll have @hiraksarkar write up a brief overview of the current state of the codebase (including which branch contains the latest stuff) this week. We can either share that information in the issues over at that repo, or we can e-mail you the write-up @schelhorn, @tetianakh and @roryk. Let me know if one method is preferable to the other.

schelhorn commented 7 years ago

Great; directly in the repo is preferred.

kellrott commented 7 years ago

This sounds cool. Have you looked at submitting your method for the DREAM RNA-Seq analysis challenge ( https://synapse.org/SMC_RNA ) ?

nellore commented 7 years ago

And any status updates? I'd be interested to test drive a quasi-mapping-based fusion caller!

schelhorn commented 7 years ago

One fast way using pseudo-alignments should be Kallisto+[Manta|Pizzly], but I haven't tried that myself. We decided to go with full transcriptome alignments instead and integrated EricScript into bcbio. We'd still be interested in something more modern, though.

rob-p commented 7 years ago

If one has a downstream fusion pipeline that uses transcriptome mapping, you can already get those from the -z= option for a while. The real challenge is how to properly control the false positive rate. That's the main thing special purpose downstream software must solve.

nellore commented 7 years ago

Thanks for the tips; I'll experiment.

erprateek commented 5 years ago

Hi @rob-p, We are working towards creating fusion calling pipeline based on Salmon/Pizzly. It would be helpful to see the current state of the repository and try to replicate some of the experiments we have done with it. We seem to be hitting good specificity but lagging a bit short on sensitivity. Thanks, Prateek

taylorreiter commented 2 years ago

Hello @rob-p! I was wondering if there have been any updates on the fusion/detection of spanning reads problem. I'm about to embark on a project to process many bacterial transcriptomes from many different genomes/species and plan to use salmon. I would love to be able to detect polycistronic transcripts through the identification of spanning reads.