bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Sailfish and disambiguation #1132

Closed schelhorn closed 8 years ago

schelhorn commented 8 years ago

Thanks, @roryk, for setting up sailfish. However, it appeared to me that it isn't a complete replacement for express yet since you are taking the input fastqs rather than the transcriptome-aligned+disambiguated ones that express took. Since the disambiguation functionality is already in place for express, would it make sense to trigger transcriptome alignment and disambiguation also for sailfish iff disambiguation is configured? I know that doing a full alignment before is rather pointless for express (since it is alignment free), but our disambiguation approach is still based on alignment rather than pseudo-alignment so that is the way we would have to go until someone comes up with a better species filter.

roryk commented 8 years ago

Hi @schelhorn,

You're right, as usual. I was thinking a way to do disambiguation for Sailfish would be to create an index of all of the species1 and species2 transcripts and quantitate them all together and then separate them into the species1/species2 counts after. What do you think about that?

mjafin commented 8 years ago

Hi both, I've seen others use a combined-reference approach but what isn't clear to me is how it resolves multimapping reads (that align to both). The reason we decided to stick to aligning to both separately was partially because we didn't want to bear the burden of maintaining multiple combined pseudo references (we've had rat explants too) but if you build them on the fly it's not that big an issue.

roryk commented 8 years ago

Thanks @miika. Yeah, we build the indices on the fly so it would be simple to do. I was hoping that it will just end up using the unique kmers for each transcript and figure it out automatically.

mjafin commented 8 years ago

Well the Xenome tool is kind of based on a similar idea so might work.

roryk commented 8 years ago

Lior just tweeted an update to his kallisto for metagenomics paper too, which uses a similar approach I think: http://arxiv.org/abs/1510.07371

schelhorn commented 8 years ago

This biostars post, especially the second to last comment, may be of value here.

schelhorn commented 8 years ago

Thanks for the implementation, Rory. I'll run both express and sailfish on a couple of PDX samples to test the difference in disambiguation techniques (STAR genome/alignment-based disambiguation for express versus combined index for sailfish). I'll report what I find here. If the correlation is sufficiently high we should be fine with using sailfish. We'll still run STAR, though, for gene fusions.

mjafin commented 8 years ago

We might be able to combine hisat2 output with Manta for calling fusions in RNA-seq, looking into it.

roryk commented 8 years ago

Thanks, that would be great. Looking at the SEQC data there are tons of differences comparing STAR on hg38-noalt and hisat2 on hg38. Sailfish quantitation is almost perfectly correlated to each other on both of those builds; I'm not sure how much of that is due to the swap in aligners or the annotation so I'm rerunning hg38-noalt with hisat2 to compare the aligner differences.

schelhorn commented 8 years ago

@mjafin: sounds pretty wild, I love it. Using the same caller for transcriptomic and genomic SVs would be sweet, since one assay could confirm the other and we should be able to see a more complete picture of the underlying aberration. Also, having a transcriptomic biomarker for a genomic event might be useful for clinical applications. So go for it, I'd say.

mjafin commented 8 years ago

@schelhorn yep we've got a sample with both RNA and DNA-seq on FGFR3-TACC3. Manta calls the event in DNA no problem, now just need to understand if it's possible to tune Manta for Hisat2. @chapmanb mentioned it works for STAR, but then again STAR produces separate files for ordinary and fusion reads. In the Hisat2 alignments I can easily see the discordant alignments with soft clipping around the exon end where the fusion is.

chapmanb commented 8 years ago

Bringing @ctsa into the conversation. Chris, is the RNA-seq fusion detection in Manta specific to STAR alignments, or could it work with HISAT2 alignments? We're trying to get it working with build 38 and HISAT2 handles all of the alts correctly so we're starting to migrate over to it, and would love to be able to confirm fusions with RNA data.

ctsa commented 8 years ago

Thanks for pulling us in -- @felixschlesinger is handling the RNA fusion capability in Manta and should be able to comment.

felixschlesinger commented 8 years ago

I have only tested Manta for RNA fusions with STAR alignments, but it should work with other aligners that produce split read alignments following the BAM 'supplementary alignment' standard, similar to bwa-mem. I.e. for STAR we are using the output with chimeric reads in the main BAM.

If you only have softclipped reads, but not actual split alignments, the issue will be candidate generation. I.e. Manta can realign the softclipped reads to the fusion later, but only once it has a candidate fusion region to work with. In the absence of split reads those candidates could come from discordant pairs, but I have never tested that (since STAR it pretty good at finding split reads).

Also note that for our real fusion calling pipeline we are doing more scoring and filtering downstream of Manta (the typical 'ad-hoc' filters for pseudogene problems etc.) Dealing with some of those things in Manta itself is something I would like to do eventually, but probably not very soon.

So yes, I think it should be possible, but it will likely involve some work.

mjafin commented 8 years ago

@ctsa @felixschlesinger Thanks for checking in, much appreciated. AFAIK HISAT2 doesn't do split reads at the moment so we'd rely on discordants only for the moment. It's been brought up previously that STAR isn't very sensitive for fusions actually and when I tried it on the spike-in dataset associated with this paper http://www.biomedcentral.com/1471-2164/15/824 it only pulled 3 out of 9 or so in the strongest dilution.

Have you tried STAR + Manta on the spike-in data set?

felixschlesinger commented 8 years ago

We have used the same spike-ins for testing, but in different samples / libraries. We can call all 9. We are using the assembly and realignment logic of Manta, but the process relies on the aligner generating some evidence of a fusion first, so that Manta has a candidate to start from. That evidence can be discordant pairs or split reads. In my testing split reads from STAR have worked well. But discordant pairs also should in principle work on their own as well.

The RNA features of Manta are still under active development and not everything is merged into the release versions, but if you want to give it a first try, all you need is to run Manta with --rna and set

minDiploidVariantScore = 0 minPassGTScore = 0

in the config.ini file, since the scoring of RNA variants (in Manta itself) is still unreliable.

felixschlesinger commented 8 years ago

Btw. I am using these STAR options for 'chimeric' (i.e. split) reads (for 2x76bp data mostly): options["chimSegmentMin"] = "12"; options["chimJunctionOverhangMin"] = "12"; options["chimScoreDropMax"] = "30"; options["chimSegmentReadGapMax"] = "5"; options["chimScoreSeparation"] = "5"; options["chimOutType"] = "WithinBAM";

mjafin commented 8 years ago

@felixschlesinger Very good to hear you're not losing sensitivity with STAR. My experiments were with STAR + OncoFuse a little while ago already. Are the settings you refer to STAR-specific or do you reckon we could use these with HISAT2 too? Edit. I just reread and obviously your latter comment is STAR-specific

roryk commented 8 years ago

Thanks @felixschlesinger, that is super helpful. We've just been setting chimSegmentMin and OverhangMin to 15 and not using any of the other options. I'll swap our settings to use those. STAR 2.4.5 added two new chimeric settings:

Implemented --chimSegmentReadGapMax parameter which defines the maximum gap in the read sequence between chimeric segments. By default it is set to 0 to replicate the behavior of the previous STAR versions.
Implemented --chimFilter banGenomicN | None options to prohibit or allow the N characters in the vicinity of the chimeric junctions. By default, they are prohibited - the same behavior as in the previous versions.

do you think either of those would be useful to set?

roryk commented 8 years ago

Oh nevermind, I see you have chimSegmentReadGapMax in there.

schelhorn commented 8 years ago

I'd be happy to close this issue since its main objective (disambiguation with sailfish) has been reached. However, we have transgressed and moved over to the manta fusion topic and the hisat2 vs star aligner questions on hg38, which seem to be of interest to many people including myself. So I'll keep this issue open for a bit longer. Feel free to close it later, @roryk.

schelhorn commented 8 years ago

Regarding calling gene fusion events on both RNA and DNA data, there seems to be a new paper from WashU that introduces a novel method for that approach (INTEGRATE) as well as experimentally validated ground truth data set with DNA/RNA fusion events for HCC1395. Perhaps this is of use for @mjafin or @felixschlesinger for validating (hisat2, star) +manta combos.

From the abstract of the linked paper:

Currently, there are many computational tools that predict structural variations (SV) and gene fusions using whole genome (WGS) and transcriptome sequencing (RNA-seq) data separately. However, as both WGS and RNA-seq have their limitations when used independently we hypothesize that the orthogonal validation from integrating WGS and RNA-seq could generate a sensitive and specific approach for detecting high confidence gene fusion predictions. Fortunately, decreasing NGS costs have resulted in a growing quantity of patients with available genome and transcriptome sequencing data. Therefore, we developed a gene fusion discovery tool, INTEGRATE, that leverages both RNA-seq and WGS data to reconstruct gene fusion junctions and genomic breakpoints by split-read mapping. To evaluate INTEGRATE we compared it with eight additional gene fusion discovery tools using the well-characterized breast cell line HCC1395 and peripheral blood lymphocytes derived from the same patient (HCC1395BL). The predictions subsequently underwent a targeted validation leading to the discovery of 131 novel fusions in addition to the seven previously reported fusions. Overall, INTEGRATE only missed 6 out of the 138 validated gene fusions and had the highest accuracy of the nine tools evaluated. Additionally, we applied INTEGRATE to 62 breast cancer patients from the TCGA and found multiple recurrent gene fusions including a subset involving estrogen receptor. Taken together, INTEGRATE is a highly sensitive and accurate tool that is freely available for academic use.

mjafin commented 8 years ago

@schelhorn sounded very interesting until the "freely available for academic use" disclaimer. Bummer.

schelhorn commented 8 years ago

Yes, it's always the same game. In the beginning people are thinking about monetizing their wacky (I'm assuming) prototype research software and later on they see how much work that is and that there's no community supporting it. Still, if the experimental data is available (and there isn't much value to the paper if it isn't) then that would make a nice ground truth data set for the manta RNA mode. And who knows - perhaps commercial use will be free as well (they just didn't say).

schelhorn commented 8 years ago

Given that comments have stopped I am now closing this issue; I suggest that discussion of the fusion topic could be moved to a new issue instead that is linked to this one.

roryk commented 8 years ago

Thanks, let us know if the Sailfish-style disambiguation is not good.

schelhorn commented 8 years ago

By the way, sailfish index requires 16GB of real memory (top RES) for building a human-mouse index. That is significantly more then for the human genome along (~8GB as per the sailfish paper).

Also, currently sailfish sits in the cufflinks + samtools parallel environment in bcbio. Given that cufflinks is not run per default (I believe), this should change. I have changed the cufflinks resources in my bcbio_system.yaml to get more memory for sailfish, but that is not very transparent for other users...

roryk commented 8 years ago

Thanks @schelhorn, fixed it.