RNA-seq variant calling best practices

mjafin commented 10 years ago

Hi All, Collecting some information and opinions around RNA-seq variant calling here. The Broad institute have released best practices for germline variant calling in RNA-seq, http://gatkforums.broadinstitute.org/discussion/3891/best-practices-for-variant-calling-on-rnaseq.

What they're essentially doing is using a new restricted walker, SplitNCigarReads, to split splice junction reads (N in CIGAR) into multiple reads, as most variant callers (Gatk, FreeBayes too) seem to choke on N-reads. Further, they're doing some trimming of intronic tails, saying they're resulting from poor alignments, but in my opinion they could also be pre-mRNAs. Furthermore in variant calling they're generally lowering the quality thresholds.

I took a look at the output of SplitNCigarReads and it's actually producing improper BAM files but the Broad commented that these BAM files must only be used for HaplotypeCaller and not for anything else downstream: http://gatkforums.broadinstitute.org/discussion/comment/12950/#Comment_12950

For a more generic approach, I could maybe look into implementing a proper N-read splitter that would also set the FLAG correctly and fix some of the tags (like NM) for the split reads, if read splitting was seen as the way to go.

Any thoughts anyone?

chapmanb commented 10 years ago

Miika; It would be great to have RNA-seq variant calling support, both with the GATK practice and an unrestricted custom version. The trickiest parts are going to be validating and setting filtering parameters to avoid RNA-seq artifacts but that is something we can tackle after we have current best-practices in place. Thanks for starting the discussion.

mjafin commented 10 years ago

I've seen some 10% of SNPs in RNA-seq data being in putative RNA editing loci (as per http://rnaedit.com/). The Broad seem to have completely ignored things like these in their best practices (with focus only on splitting N-reads and lowering quality thresholds). To be continued..

roryk commented 10 years ago

Hi Miika,

Variant support for RNA-seq would be awesome, thanks for getting the discussion started. We've had a couple pings about it.

Great-- I am agreed that editing loci need to be filtered out of the SNP analysis, I'm surprised the Broad tools haven't implemented that. That might be something to add to the GEMINI database as a user-supplied annotation at the end so you can do selects by not being in the putative editing loci. There are some other gotcha areas of calling SNPs from RNA-seq data that probably need to get filtered out too, calling at junctions is one place, calling in regions with high self-chaining too.

roryk commented 10 years ago

At any rate having something set and then tweaking it like Brad said sounds like a good plan.

mjafin commented 10 years ago

Hi Rory, I got started on the N-read splitting here https://github.com/mjafin/splitNreads. The read splitting works and tries to produce actually valid BAM files, but doesn't do any of the intron tail trimming the Broad tool does.

I'm happy to look into introducing the required steps in bcbio. As far as I can see, one easy route would be to check within the variant2 pipeline if star is selected as the aligner and then run the alignments through the N-read splitting prior to sorting.

roryk commented 10 years ago

Thanks for getting going with this MIika, looks great. @chapmanb, what do you think? I remember we talked about how to slip in the RNA-seq variant calling but I forget if we came to any plan about it.

chapmanb commented 10 years ago

Rory and Miika; Sorry for the delay, trying to catch up. I think the best approach would be to hook up the RNA-seq pipeline to have optional variantcalling steps using the same configuration keywords as the standard variantcalling pipeline. This way splitting reads and other RNA-seq specific steps could be handled specifically within that pipeline, and germline/tumor specific things happen within the variant pipeline. We'll try to clear up some time to add the framework so the gaps can get filled in.

jmerkin commented 10 years ago

Hi Miika,

I've seen you posting around a bit (elsewhere, now here) regarding variant calling from RNA-seq, something I'm trying to do myself. If the thread is too old to revive, I can start a new one. I'm trying out your splitnreads and wanted to chime in on an issue that I can see regarding such an approach in general.

When converting rna-seq reads to pseudo-dna-seq, the splice site will very frequently be a new read end. I expect this will lead to a bias against calling variants or indels near the splice junction, since you won't have a window of good mapping around the putative variant after the reads are split since they were mapped as spliced RNA, and thus the window around the variant is in a different exon. As an atlernative to just splitting at Ns, what about something like a heuristic "tweaking" of the splitting? We could take the minimum of the actual overhang and maybe 5, 6, 10, or 20 bases and add reference sequence to the end, pulling the mapping and base quality from the actual mapping to the other exon. This shouldn't affect intronic variants since they'd be missed anyway coming from RNA-seq, though I think it would maybe hurt sensitivity to variants in exons containing an alternative splice site. Still, there are more exons than alternative splice sites, so I think it would be a net gain in sensitivity (could even test this).

Jason

mjafin commented 10 years ago

Hi Jason, Thanks for chiming in, good to see others interested in RNA-seq variant calling! I like your idea of tuning the pseudo-DNA-seq-ification. What I implemented was something very simple based on the ideas by the Broad institute for their best practice RNA-seq variant calling (which were hugely underwhelming). I believe in the Broad implementation they do something on top of what I do but probably nothing as elegant as what you suggest. I would be happy to work on this together and welcome any pull requests to: https://github.com/mjafin/splitNreads

However, the only initial reasons for implementing splitNreads were the choking of FreeBayes and Gatk on RNA-seq data (where they see complex spliced reads). Ideally, FreeBayes should support RNA-seq data natively without these hacks I think.

I see a few options:

Use the Broad best practice tool (you need a license from Appistry obviously for non-academic use)
Improve splitNreads
Use variant callers that support spliced reads. One such is 'VarDict' (https://github.com/AstraZeneca-NGS/VarDict/), our in-house open source variant caller now incorporated in bcbio. It is possible to run bcbio in RNA-seq mode first, to produce STAR aligned bam files (and all the RNA-seq quantification), then launch another run of bcbio in DNA-seq mode with vardict as the caller (and disabling realignment/recalibration).

Regardless, there will be loads of false positives so some filtering will be in order. I'd suggest filtering RNA-editing sites (http://rnaedit.com/) to begin with. A reference standard evaluation wouldn't hurt either (something like NA12878).

schelhorn commented 9 years ago

I'd also be interested in having RNA Seq variant calling implemented. Is this topic still of interest to the community, and may a single-run implementation of the vardict approach as a free alternative to the existing GATK implementation make sense?

mjafin commented 9 years ago

Hi @schelhorn, you can currently use STAR (or Tophat) aligned bam files from the RNA-seq run in a second pass of bcbio, selecting the variant calling pipeline and vardict as the variant caller, to call variants in RNA-seq. All Gatk steps must be disabled, including realignment and recalibration. It works but YMMV.

roryk commented 9 years ago

Hi Sven-Eric,

Another option is to stick variantcall: gatk under the algorithm section for your RNA-seq run, and it will call variants as part of your RNA-seq run using GATK. We haven't done very much work with RNA-seq variant calling, so if you have any suggestions about where we can improve, especially at filtering out the false positives, that would be super sweet.

schelhorn commented 9 years ago

@mjafin: Thanks, but that two-run approach was something I'd like to avoid, if possible. That's why I was asking if it would make sense to promote Vardict to a first class citizen for RNA Seq variant calling, similar to GATK. Perhaps the intention of my question wasn't clear. @roryk: True, but the GATK walker used in the current implementation is non-free, AFAIK. I would currently like to avoid shelling out $32k for the GATK commercial license.

mjafin commented 9 years ago

I'm not familiar with how @roryk implemented the direct Gatk based RNA-seq variant calling, but if it's possible to skip the Gatk spliced read splitting then VarDict should be able to produce variants fairly easily (it handles spliced reads intrinsically).

inti commented 9 years ago

Hi, I am interested on this as well. I am currently working as:

1 - HISAT->freebayes->VCF1-> re map with GSNAP with known variation
        '->StringTie->GTF1
2- GSNAP using VCF1 and GTF1->freebayes->VCF2
        '->StringTie->GTF2

I am now changing freebayes from the first step for VarDict. The first round of variant calling is done sample-wise and the second round is done using multisample calling on all position identified as likely variation on the first round.

I choose HISAT as first mapper because it is pretty fast (any comments on this welcomed), but could be replace by something else fast and good like SNAP, and GSNAP as second mapper because it is very sensitive and comes out well on evaluation. This was actually somehow inspired by the description of GSTRUCT,a GSNAP/GMAP software family not released yet but described on sup mat of Engström et al (2013 www.nature.com/articles/nmeth.2722). I actually have a cuff merge step in the middle to merge the StringTie assemblies with a known set of gene annotation, which I may (or may not) replace for PASA pipeline later (need to evaluate computational demand and actual improvements)

What option are you using for VarDict on RNA-seq @mjafin (if I can pick your brain)?

roryk commented 9 years ago

I kind of forgot about this-- @mjafin have you run VarDict on RNA-seq data? It would be sweet to have a free version implemented instead of having to use GATK.

schelhorn commented 9 years ago

Bcbio VarDict support for RNA Seq variant calling would be great, +1!

mjafin commented 9 years ago

Ugh, me too.. Yes VarDict runs on Star aligned BAMs out of the box. No Broad tools should be run on the bams between alignment and VarDict. I think @roryk if you can skip the bam splice splitting in the rnaseq pipeline then we could produce VarDict variants readily. We should use a 15 to 20 percent AF cutoff to get rid of some of the noise plus I should really look into annotating RNA editing sites after calling

schelhorn commented 8 years ago

Han et al., Cancer Cell. (2015) have recently published a RNA-editing study on most of the TCGA data (5,672 RNA-Seq samples, that is a little less than number of samples that were screened for genomic mutations in TCGA I believe). Their approach may be a little ad-hoc, but they also use the RADAR/rnaedit.com database. In any way, having RNA-Seq calling with vardict would be awesome, and annotation with (but not filtering of) RNA editing events would be a nice plus.

From the publication:

We obtained a comprehensive collection of ∼1.4 million A-to-I RNA editing sites from RADAR (http://rnaedit.com) (Ramaswami and Li, 2014). Note that these RNA editing sites were directly called from RNA-seq data from normal tissues and tumor samples, not from the comparison of editing profiles upon ADAR perturbation. We re-annotated them by ANNOVAR (Wang et al., 2010) and then filtered ∼4,000 sites annotated in dbSNP (version 137), COSMIC, and TCGA somatic mutations. On the basis of the RNA-seq reads mapped to the human reference genome (hg19), the editing level at a specific site in a given sample was calculated as the number of edited reads divided by the total number of reads (Ramaswami et al., 2013), and only the nucleotides with base quality ≥20 were used. Those editing sites with at least three edited reads in at least 3 samples per tissue type were considered to be detected RNA editing sites. To ensure adequate statistical power, we further identified the informative RNA editing sites among the detected RNA editing sites by requiring at least 30 samples (including normal samples if available) with coverage ≥ 10 in a tissue/tumor type. Thus, given a cancer type, the tumor samples and their related normal samples had the same set of informative RNA editing sites in our analysis. To further rule out the possibility of potential contamination due to undetected SNPs or somatic mutations, we obtained whole-genome sequencing data from International Cancer Genome Consortium and whole-exome sequencing data from TCGA for the cancer types we surveyed and assessed if there were some potential mutational signals at informative RNA editing sites. We found potential mutational signals at only 310 sites out of 112,572 across the 17 cancer types (0.28%) and excluded them from our analysis.

roryk commented 8 years ago

Thanks @schelhorn,

Totally agree, thanks for putting this on our RADAR. I'm slowly working through my list of stuff to implement for bcbio-nextgen and this is on the list.

schelhorn commented 8 years ago

Great; take your time, your work is much appreciated.

pwaltman commented 8 years ago

@mjafin, I hope you don't mind me re-opening an old thread, but when you said that VarDict could/would handle STAR aligned bams out of the box, can you explain what pre-processing - if any - VarDict requires be applied? Note, I have applied STAR by hand (NOT via the bcbio pipeline), using the 2-pass alignment, as recommended by the Broad; but having provided a GTF to STAR (as recommended by STAR's author). The exact steps I followed are here: https://groups.google.com/d/msg/rna-star/0YGQtuwcmTY/NxgAapIJAwAJ.

Am I correct in assuming, they would need to be sorted & de-duped? Should users also call STAR with the "--outSAMstrandField intronMotif" parameter (necessary to add the XS param to the bams that are generated)?

mjafin commented 8 years ago

@pwaltman VarDict is generally unfussy about bam files but it's generally best to avoid using any Broad tools being applied to BAMs. Sorting and indexing is necessary, de-duping is optional. We typically use the BAMs as they come from bcbio. For unstranded data bcbio uses the infronMotif setting, see https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/ngsalign/star.py#L67

ekg commented 8 years ago

@mjafin and others, freebayes should now handle the spliced reads. If this isn't the case please ping me over on the freebayes repo. The only problem I remember was during left alignment of indels. I don't have the commit I'd handy but I believe it has been resolved.

mjafin commented 8 years ago

Thanks Erik, that should give us plenty of free and commercial options now to run variant calling in RNA-seq. I'll close this issue even though not everything here has been necessarily implemented (RNA-editing)

bcbio / bcbio-nextgen

RNA-seq variant calling best practices #389