chapmanb / bcbio.variation

Toolkit to analyze genomic variation data, built on the GATK with Clojure
66 stars 15 forks source link

getEnd() / END KEY Error #13

Closed BioInfo closed 10 years ago

BioInfo commented 10 years ago

When running

variant-compare I get an error

org.broad.tribble.TribbleException: Badly formed variant context at location 1:948929; getEnd() was 948936 but this VariantContext contains an END key with value 948944

From the VCF:

1 948929 . GGCCCACA G 100.813537528125 PASS SAMPLE=NA12878;DP=48;END=948936;VP=6;AF=0.125 ;BIAS=2:0;REFBIAS=25:17;VARBIAS=6:0;PMEAN=29.3;PSTD=12.97;QUAL=39;QSTD=2.3;SBF=0.0766397001445299;ODDRATIO=0 GT:DP 0/1:4 8 1 948937 . GCCCACAG G 89.6981987750241 PASS SAMPLE=NA12878;DP=49;END=948944;VP=6;AF=0.122 45;BIAS=2:0;REFBIAS=26:16;VARBIAS=6:0;PMEAN=40.7;PSTD=5.75;QUAL=34.7;QSTD=3.5;SBF=0.159450278009751;ODDRATIO=0 GT:DP 0/1:4 9

chapmanb commented 10 years ago

Justin; Thanks for the report. It appears as if the END tags aren't getting updated/merged correctly. Out of curiosity, what program generates that VCF? I'll take a look at stripping END tags when running comparisons.

chapmanb commented 10 years ago

Justin; Thanks again for the report. It looks like this is a bug within GATK's LeftAlignVariants where it adjusts the variant and then gets the END tag out of sync. I pushed a fix to remove these END tags from indels, since they are redundant and intended more for non-exact structural variants, before passing through. A new test release with the fix is available (https://github.com/chapmanb/bcbio.variation/releases/tag/v0.1.3-SNAPSHOT-20131231) if you have a chance to confirm. Thanks much.

mjafin commented 10 years ago

Hi Brad, I'm seeing this error with the latest developmental version. The variant producing the error is as follows:

1       889158  .       GA      CC      163.0   PASS    ADJAF=0;AF=1;BIAS=0:2;DP=22;END=889159;GC=66.34;HIAF=1.00000;LSEQ=ACTGGGCCAC;MQ=60.00;MSI=0;ODDRATIO=0;PMEAN=28.3;PSTD=1;QD=7.41;QSTD=1;QUAL=35.3;REFBIAS=0:0;RSEQ=ACCTTGAGCA;SAMPLE=NA12878-1;SBF=1;SHIFT3=0;SN=16;VARBIAS=19:6;VP=25;EFF=DOWNSTREAM(MODIFIER||4468|||NOC2L|retained_intron|CODING|ENST00000469563||1),DOWNSTREAM(MODIFIER||648|||NOC2L|processed_transcript|CODING|ENST00000487214||1),INTRON(MODIFIER||||749|NOC2L|protein_coding|CODING|ENST00000327044|8|1),INTRON(MODIFIER|||||NOC2L|retained_intron|CODING|ENST00000477976|6|1)  GT:AF:DP:VP     1/1:1:25:25

Using bcbio.variation 0.1.5. The tool that inserts the END tag in there is a variant caller developed internally that we're hoping to release to the outside world soon enough. I guess I could just rip out the END-tags from the caller as a short term solution

mjafin commented 10 years ago

As an aside, the above variant is given as two consecutive SNPs in the NIST call sets, and reported as two consecutive SNPs by FreeBayes. I wonder why they're not given as an MNP in the NIST file? I'm probably missing something, but in the BAM file the variants are present in the same reads, supporting an MNP.

chapmanb commented 10 years ago

Miika; When you're using bcbio.variation, you need to have preclean: true in the configuration options (https://github.com/chapmanb/bcbio.variation#configuration-file) to enable the END tag stripping. I added a small test case and it will be stripped and proceed okay if you do this. This parameter turned off by default since it's resource intensive and within bcbio-nextgen we normally fix VCFs before passing to bcbio.variation. How are you feeding your inputs into bcbio.variation?

In terms of the representation, NIST and bcbio.variation both normalize MNPs into the individual SNPs and indels for comparison purposes. Since variant callers represent these differently, this is the only way to ensure clean comparisons. Additionally, most callers output an anchor reference base on MNPs like this, so this should be represented as:

1       889158  .       CGA      CCC

Practically, if you feed the output of your variant callers to vcfallelicprimitives this will take care of normalizing these and cleaning up the END tags. That's probably the best way to handle these and ensure the output is compatible with other downstream software.

mjafin commented 10 years ago

Thanks Brad, all makes a lot of sense! I'm using this as part of a bcbio-run, so not directly.

This is unrelated to bcbio.variation, but for the new caller I'm integrating, I shouldn't be running vcfallelicprimitives by default on all variant outputs, should I?

Further, I had a look at the freebayes folder (outside validate) and FreeBayes called the consecutive bases as independent SNPs rather than MNPs. I wonder if this is a FreeBayes thing or if bcbio enables vcfallelicprimitives on FreeBayes output somehow?

chapmanb commented 10 years ago

Miika; I do think it would be worth running vcfallelicprimitives on your output by default. We do this for FreeBayes:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/freebayes.py#L80

You still retain the phasing information in the VCF and it makes the outputs easier to merge and compare.

mjafin commented 10 years ago

Thanks again Brad, I'll bear that in mind!

mjafin commented 10 years ago

Ah, one more quick q, if the MNP is reported as two independent SNPs, then won't the effect annotation treat them as two separate possible codon changes rather than the one real?

chapmanb commented 10 years ago

Miika; That's right. That is a downside of the allelic primitive approach. It's good to reconsider surrounding mutations in any hit of interest in general, but the only concern is that a single mutation alone would not trigger prioritization, but two together would. Honestly the more common case is that two nearby mutations mitigate predicted high effects: one causes an insertion out-of-phase and the second corrects it.