AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
127 stars 56 forks source link

VardictJava having extensive runtime on very highly expressed RNA regions #39

Closed smboyle closed 8 years ago

smboyle commented 8 years ago

We are running into cases where VardictJava is taking days to finish running on regions with very high coverage in RNA (highly expressed genes, coverage >700,000x). Have you worked through similar problems in the past? Do you have suggestions on how to improve performance on these very deep regions? Mutect, for example, has an option (dcov) where you can limit the max depth considered by the caller, greatly speeding up variant calling on deep regions.

mjafin commented 8 years ago

Hi @smboyle sorry about the continued problems. We've seen this in some data sets where expression is super high (like insulin in pancreatic islets) but haven't got a direct solution in place right now, apologies for that.

VarDict essentially taps into the samtools API (java htslib if you're using Java VarDict) so we're limited to what samtools view offers. Samtools has a downsampling option but it downsamples all regions, not just the ones with high coverage. This is the -Z option in VarDict. It would in theory be possibly to build logic externally to first quantify expression and then adaptively downsample in regions of high expresion but it would be outside the core scope of VarDict itself.

You could also experiment with turning realignment off to speed things up with the obvious caveat the it would be off for all the other regions too. Another option might be to slit the regions very narrow around genes you expect to be expressed highly or possibly use marking of duplicates (VarDict can use BAMs that have duplicates marked or mark duplicates itself) to see if it lowers the number of reads in high coverage regions.

Sorry about not having a more direct solution for your problem. If you have any ideas or suggestions on this we welcome pull requests.

As an aside, by default MuTect downsamples all real life data sets and if turned off its performance is pretty horrible.

mjafin commented 8 years ago

@ohofmann just pointed out VariantBam (http://bioinformatics.oxfordjournals.org/content/early/2016/03/15/bioinformatics.btw111) that could be used to cap maximum coverage. @chapmanb mentioned that the code is still very much developmental and segfaults every now and then. Something to look into for us.

smboyle commented 8 years ago

@mjafin: We have implemented an erosion script upstream of the vardict pipeline which reduces the coverage in these very highly covered regions, as you have suggested. This fixed the run time and drastically sped up analysis on RNA samples! Thanks for the assistance, Sean

mjafin commented 8 years ago

@smboyle: nice, thanks for letting us know. If you wanted to share your approach with the community we could blog about it at some point. Sounds like something the RNA-seq variant calling community could use

roryk commented 8 years ago

This would be great to include in the bcbio-nextgen RNA-seq variant calling step if the process is available.

mjafin commented 7 years ago

@smboyle did you end up "productionising" your downsampling approach and if so, would you still consider sharing with the community?