bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
990 stars 355 forks source link

Adding additional RNA-Seq pipelines or components to bcbio #2468

Closed bioinfonerd closed 5 years ago

bioinfonerd commented 6 years ago

I would like to add additional RNA-Seq pipelines and other alternatives such as aligners, quantifiers, trimmers, or qc statistics for using in bcbio.

Is there a wiki or section on how to contribute to the development of bcbio?

I'm with LSP at HMS, so this may be better to discuss in person.

roryk commented 6 years ago

Hi @bioinfonerd,

We accept pull requests for new features, though it would be good to flesh out what you were thinking of adding before you roll up your sleeves and get cracking. We already have support for trimming, for example so it would be good to know what the current trimming is missing that you'd like to add. Same with the aligners/quantifiers; we'd want to make sure we are implementing support for something that is an improvement over the existing methods. That said, we're constantly trying to improve.

bioinfonerd commented 6 years ago

Thanks for the response. What I am thinking is the following. There is a constant wave of new aligners and quantifiers with variability on what is considered the 'best', which could largely be attributed to experimental design. So rather, I would like to run a suite of aligners and quantifiers to automate fleshing out what DEG can be relied on and a category of DEG based on the variability between pipelines to warn researchers.

The bcbio seems to have an ideal setup to allow this purpose as one can mix and match aligners and quantifiers. However, there are a few aligners/quantifiers based on one of the most recent RNA-Seq benchmarking papers (Baruzzo 2017/18) that should probably be included namely: Novoalign, GSNAP, Mapsplice2. I apologize if they are included, but I was unable to find them.

What I was hoping is a relatively straightforward way of including new quantifiers and aligners in bcbio as they will only have a few modifications on what input they will want and a few certain patterns on what output is produced.

Thought I would ask in case any of this has already been done.

roryk commented 6 years ago

Hi @bioinfonerd,

Thanks for the discussion, we're always looking to improve.

That said, we've been moving away from doing full alignments for RNA-seq as part of an align-and-count method, as the kmer/pseudo/quasialignment type methods seem to do a better job with downstream quantification for DEG/DTU, are much faster and allow for bootstrapping type methods which improve transcript-level differential expression with sleuth https://paperpile.com/app/p/c98c76a0-dcbd-0c95-b6c2-8e7d138a8186. DESeq2 will soon be able to incorporate the bootstraps as well in its statistical model as well.

We do use full genome alignments for variant calling, quality control and transcriptome reconstruction but not for anything else; improvements in alignments are most likely to help out with variant calling. For DEG, see https://paperpile.com/app/p/92906d63-7d97-05f3-b410-721708de566f for an example of pseudoaligners doing a better job than a few different align and count methods for expression quantification. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1526-y has a RSEM comparison, and compares two quasialignment methods.

It's the count part of the align-and-count methods that is the issue and swapping in different aligners won't help that issue all that much. We'll be dropping full alignments in favor of quasialignments + projection onto genome coordinates in the future for quality control and turn off full alignments for everything except variant calling and transcriptome reconstruction.

Let me know what you think, thanks again for opening up the discussion.

PatrickJReed commented 6 years ago

Hi @bioinfonerd are you proposing the development of an "ensemble" like approach for the quantification of DEGs, where the ensemble set is the aligner/pseudo/quasialignment algorithm(s), or more broadly, how counts are generated? Would this require the calling of DEGs to become built in functionality for bcbio rather than externally using DESeq2, etc...

roryk commented 5 years ago

Thanks for the comments-- I added support to bcbio for doing salmon quantification with the whole genome alignments with the 'quantify_genome_alignments` flag, which falls back to Salmon's SA mode with decoys if the Salmon alignments do not exist.