bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

RNA-seq variant calling on cancer samples #3023

Closed roryk closed 4 years ago

roryk commented 4 years ago

Via @mjsteinbaugh:

@naumenko-sa I'm looking into using bcbio for RNA-seq variant calling at the moment. Are the current recommendations still VarDict and MuTect2 for cancer cells? I'd like to add some additional documentation on how to properly do variant calling from the RNA rather than the DNA with bcbio. Rory added some notes in the HISTORY.md file, but it's still not totally clear to me how to set up the run YAML to do this.

See also: https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#cancer-variant-calling

roryk commented 4 years ago

Hi Mike,

We don't really have a great way of calling on cancer cells from RNA-seq in bcbio; folks have been doing it with VarDict, but I'm not super sure about the results. MuTect2 doesn't support RNA last time I checked. There's a workflow for MuTect1 (https://science.sciencemag.org/content/364/6444/eaaw0726) but it is based on GATK3 so you won't be able to use it unless you pay up. Part of the work we are going to do for the CZI grant is to 1) get RNA-MuTect in a form that other folks can use it (make a bioconda package, etc) and 2) see if we can support GATK4, so industry folks can actually use it.

Do you have normal samples or just the cancer samples? Is it from a primary tumor?

mjsteinbaugh commented 4 years ago

@roryk we're starting to look at this for cell lines and primary samples, so I'd be interested in helping script this out and adding some of this functionality into bcbio. Keep me posted and I'll see which samples we could potentially use for testing.

roryk commented 4 years ago

Thanks-- if you have DNA it would be a good first step to do the calling on the DNA and call with VarDict on the RNA and see where you are at as a baseline, using the DNA calls as the true variants.

mjsteinbaugh commented 4 years ago

Yeah we may generate some samples to do this, sequencing both DNA and RNA. That makes total sense.

lbeltrame commented 4 years ago

I've done that on RNA-seq using very non-hortodox methods, a while ago. MuTect 2 will happily digest the data anyway (if you do the CIGAR string dances at splice junctions), however it will report a very high number of indels, most of these due to actual splice events. That's where I got stuck, as it's very hard to determine what's "real" and what's not.

roryk commented 4 years ago

Thanks Luca for the feedback-- in bcbio we basically give up calling variants around splice junctions for RNA-seq variant calling, and we feed in the novel junctions identified by STAR/hisat2 when doing it and then ignore everything within 10 bases of a junction, as we found those to be 99% false positives. Do you think this would solve the problem with Mutect2 calling?

lbeltrame commented 4 years ago

I copied verbatim from there and it improved the situation but it still overcalled as far as I can tell. The lack of reliable datasets to test it was also problematic (after all, I only had my own data).

roryk commented 4 years ago

Hi Luca, thanks, that's super helpful.

lbeltrame commented 4 years ago

I forgot to mention one thing: according to a paper I can't find at the moment, using aligners like STAR and the like (but it did not test hisat2, IIRC) lead to mutated loci being missed. I tried to "fix" it by re-aligning with BWA all the unmapped reads from the first pass. It increased sensitivity somewhat, at the price of large operational complexity.

lbeltrame commented 4 years ago

Digging around I found this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5827345/ - I'm not sure if it's still applicable (I see things like UnifiedGenotyper there), but might be a good point for finding ideas on how to approach this.