Closed schelhorn closed 8 years ago
Hi @schelhorn,
You're right, as usual. I was thinking a way to do disambiguation for Sailfish would be to create an index of all of the species1 and species2 transcripts and quantitate them all together and then separate them into the species1/species2 counts after. What do you think about that?
Hi both, I've seen others use a combined-reference approach but what isn't clear to me is how it resolves multimapping reads (that align to both). The reason we decided to stick to aligning to both separately was partially because we didn't want to bear the burden of maintaining multiple combined pseudo references (we've had rat explants too) but if you build them on the fly it's not that big an issue.
Thanks @miika. Yeah, we build the indices on the fly so it would be simple to do. I was hoping that it will just end up using the unique kmers for each transcript and figure it out automatically.
Well the Xenome tool is kind of based on a similar idea so might work.
Lior just tweeted an update to his kallisto for metagenomics paper too, which uses a similar approach I think: http://arxiv.org/abs/1510.07371
This biostars post, especially the second to last comment, may be of value here.
Thanks for the implementation, Rory. I'll run both express and sailfish on a couple of PDX samples to test the difference in disambiguation techniques (STAR genome/alignment-based disambiguation for express versus combined index for sailfish). I'll report what I find here. If the correlation is sufficiently high we should be fine with using sailfish. We'll still run STAR, though, for gene fusions.
We might be able to combine hisat2 output with Manta for calling fusions in RNA-seq, looking into it.
Thanks, that would be great. Looking at the SEQC data there are tons of differences comparing STAR on hg38-noalt and hisat2 on hg38. Sailfish quantitation is almost perfectly correlated to each other on both of those builds; I'm not sure how much of that is due to the swap in aligners or the annotation so I'm rerunning hg38-noalt with hisat2 to compare the aligner differences.
@mjafin: sounds pretty wild, I love it. Using the same caller for transcriptomic and genomic SVs would be sweet, since one assay could confirm the other and we should be able to see a more complete picture of the underlying aberration. Also, having a transcriptomic biomarker for a genomic event might be useful for clinical applications. So go for it, I'd say.
@schelhorn yep we've got a sample with both RNA and DNA-seq on FGFR3-TACC3. Manta calls the event in DNA no problem, now just need to understand if it's possible to tune Manta for Hisat2. @chapmanb mentioned it works for STAR, but then again STAR produces separate files for ordinary and fusion reads. In the Hisat2 alignments I can easily see the discordant alignments with soft clipping around the exon end where the fusion is.
Bringing @ctsa into the conversation. Chris, is the RNA-seq fusion detection in Manta specific to STAR alignments, or could it work with HISAT2 alignments? We're trying to get it working with build 38 and HISAT2 handles all of the alts correctly so we're starting to migrate over to it, and would love to be able to confirm fusions with RNA data.
Thanks for pulling us in -- @felixschlesinger is handling the RNA fusion capability in Manta and should be able to comment.
I have only tested Manta for RNA fusions with STAR alignments, but it should work with other aligners that produce split read alignments following the BAM 'supplementary alignment' standard, similar to bwa-mem. I.e. for STAR we are using the output with chimeric reads in the main BAM.
If you only have softclipped reads, but not actual split alignments, the issue will be candidate generation. I.e. Manta can realign the softclipped reads to the fusion later, but only once it has a candidate fusion region to work with. In the absence of split reads those candidates could come from discordant pairs, but I have never tested that (since STAR it pretty good at finding split reads).
Also note that for our real fusion calling pipeline we are doing more scoring and filtering downstream of Manta (the typical 'ad-hoc' filters for pseudogene problems etc.) Dealing with some of those things in Manta itself is something I would like to do eventually, but probably not very soon.
So yes, I think it should be possible, but it will likely involve some work.
@ctsa @felixschlesinger Thanks for checking in, much appreciated. AFAIK HISAT2 doesn't do split reads at the moment so we'd rely on discordants only for the moment. It's been brought up previously that STAR isn't very sensitive for fusions actually and when I tried it on the spike-in dataset associated with this paper http://www.biomedcentral.com/1471-2164/15/824 it only pulled 3 out of 9 or so in the strongest dilution.
Have you tried STAR + Manta on the spike-in data set?
We have used the same spike-ins for testing, but in different samples / libraries. We can call all 9. We are using the assembly and realignment logic of Manta, but the process relies on the aligner generating some evidence of a fusion first, so that Manta has a candidate to start from. That evidence can be discordant pairs or split reads. In my testing split reads from STAR have worked well. But discordant pairs also should in principle work on their own as well.
The RNA features of Manta are still under active development and not everything is merged into the release versions, but if you want to give it a first try, all you need is to run Manta with --rna and set
minDiploidVariantScore = 0 minPassGTScore = 0
in the config.ini file, since the scoring of RNA variants (in Manta itself) is still unreliable.
Btw. I am using these STAR options for 'chimeric' (i.e. split) reads (for 2x76bp data mostly): options["chimSegmentMin"] = "12"; options["chimJunctionOverhangMin"] = "12"; options["chimScoreDropMax"] = "30"; options["chimSegmentReadGapMax"] = "5"; options["chimScoreSeparation"] = "5"; options["chimOutType"] = "WithinBAM";
@felixschlesinger Very good to hear you're not losing sensitivity with STAR. My experiments were with STAR + OncoFuse a little while ago already. Are the settings you refer to STAR-specific or do you reckon we could use these with HISAT2 too? Edit. I just reread and obviously your latter comment is STAR-specific
Thanks @felixschlesinger, that is super helpful. We've just been setting chimSegmentMin and OverhangMin to 15 and not using any of the other options. I'll swap our settings to use those. STAR 2.4.5 added two new chimeric settings:
Implemented --chimSegmentReadGapMax parameter which defines the maximum gap in the read sequence between chimeric segments. By default it is set to 0 to replicate the behavior of the previous STAR versions.
Implemented --chimFilter banGenomicN | None options to prohibit or allow the N characters in the vicinity of the chimeric junctions. By default, they are prohibited - the same behavior as in the previous versions.
do you think either of those would be useful to set?
Oh nevermind, I see you have chimSegmentReadGapMax in there.
I'd be happy to close this issue since its main objective (disambiguation with sailfish
) has been reached. However, we have transgressed and moved over to the manta
fusion topic and the hisat2
vs star
aligner questions on hg38, which seem to be of interest to many people including myself. So I'll keep this issue open for a bit longer. Feel free to close it later, @roryk.
Regarding calling gene fusion events on both RNA and DNA data, there seems to be a new paper from WashU that introduces a novel method for that approach (INTEGRATE) as well as experimentally validated ground truth data set with DNA/RNA fusion events for HCC1395. Perhaps this is of use for @mjafin or @felixschlesinger for validating (hisat2
, star
) +manta
combos.
From the abstract of the linked paper:
Currently, there are many computational tools that predict structural variations (SV) and gene fusions using whole genome (WGS) and transcriptome sequencing (RNA-seq) data separately. However, as both WGS and RNA-seq have their limitations when used independently we hypothesize that the orthogonal validation from integrating WGS and RNA-seq could generate a sensitive and specific approach for detecting high confidence gene fusion predictions. Fortunately, decreasing NGS costs have resulted in a growing quantity of patients with available genome and transcriptome sequencing data. Therefore, we developed a gene fusion discovery tool, INTEGRATE, that leverages both RNA-seq and WGS data to reconstruct gene fusion junctions and genomic breakpoints by split-read mapping. To evaluate INTEGRATE we compared it with eight additional gene fusion discovery tools using the well-characterized breast cell line HCC1395 and peripheral blood lymphocytes derived from the same patient (HCC1395BL). The predictions subsequently underwent a targeted validation leading to the discovery of 131 novel fusions in addition to the seven previously reported fusions. Overall, INTEGRATE only missed 6 out of the 138 validated gene fusions and had the highest accuracy of the nine tools evaluated. Additionally, we applied INTEGRATE to 62 breast cancer patients from the TCGA and found multiple recurrent gene fusions including a subset involving estrogen receptor. Taken together, INTEGRATE is a highly sensitive and accurate tool that is freely available for academic use.
@schelhorn sounded very interesting until the "freely available for academic use" disclaimer. Bummer.
Yes, it's always the same game. In the beginning people are thinking about monetizing their wacky (I'm assuming) prototype research software and later on they see how much work that is and that there's no community supporting it. Still, if the experimental data is available (and there isn't much value to the paper if it isn't) then that would make a nice ground truth data set for the manta
RNA mode. And who knows - perhaps commercial use will be free as well (they just didn't say).
Given that comments have stopped I am now closing this issue; I suggest that discussion of the fusion topic could be moved to a new issue instead that is linked to this one.
Thanks, let us know if the Sailfish-style disambiguation is not good.
By the way, sailfish index
requires 16GB of real memory (top
RES) for building a human-mouse index. That is significantly more then for the human genome along (~8GB as per the sailfish paper).
Also, currently sailfish
sits in the cufflinks
+ samtools
parallel environment in bcbio. Given that cufflinks
is not run per default (I believe), this should change. I have changed the cufflinks
resources in my bcbio_system.yaml
to get more memory for sailfish
, but that is not very transparent for other users...
Thanks @schelhorn, fixed it.
Thanks, @roryk, for setting up
sailfish
. However, it appeared to me that it isn't a complete replacement forexpress
yet since you are taking the input fastqs rather than the transcriptome-aligned+disambiguated ones thatexpress
took. Since the disambiguation functionality is already in place forexpress
, would it make sense to trigger transcriptome alignment and disambiguation also forsailfish
iff disambiguation is configured? I know that doing a full alignment before is rather pointless forexpress
(since it is alignment free), but our disambiguation approach is still based on alignment rather than pseudo-alignment so that is the way we would have to go until someone comes up with a better species filter.