broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
983 stars 368 forks source link

Picard LiftoverVcf Error "Duplicate allele added to VariantContext" #1353

Closed bhanugandham closed 5 years ago

bhanugandham commented 5 years ago

User Report:

Hi, I've run into a frustrating problem: when lifting over certain VCFs from b37 to hg38, I'm running into LiftoverVcf exiting with errors like Exception in thread "main" java.lang.IllegalArgumentException: Duplicate allele added to VariantContext

I've got it isolated to an example failing variant, but I'm at a loss for how to fix or prevent this error, since they seem scattered among VCFs I've generated with GATK 3.8-1-0-gf15c1c3ef GenotypeGVCFs.

Picard version 2.20.2

$ java -jar picard.jar LiftoverVcf --version
2.20.2-SNAPSHOT

Failing variant example:

$ java -jar picard.jar LiftoverVcf C=/gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz I=bad_simple.vcf O=test.vcf R=hg38.fa REJECT=test_b37.reject.vcf
INFO    2019-06-22 00:35:10 LiftoverVcf 

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    LiftoverVcf -C /gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz -I bad_simple.vcf -O test.vcf -R hg38.fa -REJECT test_b37.reject.vcf
**********

00:35:10.862 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/gpfs1/home/jlawlor/test_liftover/round_4/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Jun 22 00:35:10 CDT 2019] LiftoverVcf INPUT=bad_simple.vcf OUTPUT=test.vcf CHAIN=/gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz REJECT=test_b37.reject.vcf REFERENCE_SEQUENCE=hg38.fa    WARN_ON_MISSING_CONTIG=false LOG_FAILED_INTERVALS=true WRITE_ORIGINAL_POSITION=false WRITE_ORIGINAL_ALLELES=false LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false RECOVER_SWAPPED_REF_ALT=false TAGS_TO_REVERSE=[AF] TAGS_TO_DROP=[MAX_AF] DISABLE_SORT=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sat Jun 22 00:35:10 CDT 2019] Executing as jlawlor@hpc0005 on Linux 3.10.0-327.3.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_102-b14; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.2-SNAPSHOT
INFO    2019-06-22 00:35:11 LiftoverVcf Loading up the target reference genome.
INFO    2019-06-22 00:35:21 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
[Sat Jun 22 00:35:21 CDT 2019] picard.vcf.LiftoverVcf done. Elapsed time: 0.18 minutes.
Runtime.totalMemory()=6996623360
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: T
    at htsjdk.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1493)
    at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:379)
    at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:579)
    at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:573)
    at picard.util.LiftoverUtils.liftVariant(LiftoverUtils.java:117)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:426)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

with VCF bad_simple.vcf

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##DRAGENCommandLine=<ID=dragen,Version="SW: 01.011.269.3.2.8, HW: 01.011.269",Date="Tue May 21 12:16:34 CDT 2019",CommandLineOptions="-f -r /staging/reference/GRCh37/GRCh37.fa.k_21.f_16.m_149 --fastq-list /staging/fastq/SL385519_fastqs/SL385519_list.csv --output-directory /staging/bam/ --output-file-prefix SL385519 --enable-duplicate-marking true --enable-map-align-output true --enable-variant-caller true --vc-sample-name SL385519 --vc-emit-ref-confidence GVCF --dbsnp /staging/reference/GRCh37/dbsnp_135.b37.vcf">
##FILTER=<ID=DRAGENHardQUAL,Description="Set if true:QUAL < 10.4139">
##FILTER=<ID=LowDepth,Description="Set if true:DP < 1">
##FILTER=<ID=LowGQ,Description="Set if true:GQ = 0">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
##FILTER=<ID=VQSRTrancheINDEL99.00to99.90,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -12.1756 <= x < -1.3496">
##FILTER=<ID=VQSRTrancheINDEL99.90to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -1409.7427">
##FILTER=<ID=VQSRTrancheINDEL99.90to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -1409.7427 <= x < -12.1756">
##FILTER=<ID=VQSRTrancheSNP99.00to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -4.6589 <= x < 0.343">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -39592.3492">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -39592.3492 <= x < -4.6589">
##FILTER=<ID=lod_fstar,Description="Variant does not meet likelihood threshold (default threshold is 6.3)">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allelic frequency for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Normalized likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=ICNT,Number=2,Type=Integer,Description="Counts of INDEL informative reads based on the reference confidence model">
##FORMAT=<ID=LOD,Number=1,Type=Float,Description="Per-sample variant LOD score">
##FORMAT=<ID=MB,Number=4,Type=Integer,Description="Per-sample component statistics to detect mate bias">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PP,Number=G,Type=Integer,Description="Phred-scaled posterior genotype probabilities">
##FORMAT=<ID=PRI,Number=G,Type=Float,Description="Phred-scaled prior probabilities for genotypes">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias">
##FORMAT=<ID=SPL,Number=.,Type=Integer,Description="Normalized, Phred-scaled likelihoods for SNPs based on the reference confidence model">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FGT,Number=0,Type=Flag,Description="ForceGT variant call">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=FractionInformativeReads,Number=1,Type=Float,Description="The fraction of informative reads out of the total reads">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=LOD,Number=1,Type=Float,Description="Variant LOD score">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
##INFO=<ID=NML,Number=0,Type=Flag,Description="Normal (non-ForceGT) variant call">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=R2_5P_bias,Number=1,Type=Float,Description="Score based on mate bias and distance from 5 prime end">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained gaussian mixture model">
##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the variant was filtered out">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##reference=file:///gpfs/gpfs1/myerslab/reference/genomes/bwa-0.7.8/GRCh37.fa
##bcftools_viewVersion=1.7+htslib-1.4.1
##bcftools_viewCommand=view -h batch_65.vcf.gz; Date=Fri Jun 21 23:56:28 2019
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   143283417   .   ACCG    A,* 512.03  VQSRTrancheINDEL99.00to99.90    .

Successful variant example It doesn't seem to be solely a problem with indels or * alternates, because this variant (from nearby) has no problems:

java -jar picard.jar LiftoverVcf C=/gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz I=good_simple.vcf O=test.vcf R=hg38.fa REJECT=test_b37.reject.vcf
INFO    2019-06-22 00:34:30 LiftoverVcf 

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    LiftoverVcf -C /gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz -I good_simple.vcf -O test.vcf -R hg38.fa -REJECT test_b37.reject.vcf
**********

00:34:30.734 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/gpfs1/home/jlawlor/test_liftover/round_4/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Jun 22 00:34:30 CDT 2019] LiftoverVcf INPUT=good_simple.vcf OUTPUT=test.vcf CHAIN=/gpfs/gpfs2/cooperlab/resources/liftover_chain_files/b37ToHg38.over.chain.gz REJECT=test_b37.reject.vcf REFERENCE_SEQUENCE=hg38.fa    WARN_ON_MISSING_CONTIG=false LOG_FAILED_INTERVALS=true WRITE_ORIGINAL_POSITION=false WRITE_ORIGINAL_ALLELES=false LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false RECOVER_SWAPPED_REF_ALT=false TAGS_TO_REVERSE=[AF] TAGS_TO_DROP=[MAX_AF] DISABLE_SORT=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sat Jun 22 00:34:30 CDT 2019] Executing as jlawlor@hpc0005 on Linux 3.10.0-327.3.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_102-b14; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.2-SNAPSHOT
INFO    2019-06-22 00:34:30 LiftoverVcf Loading up the target reference genome.
INFO    2019-06-22 00:34:41 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
INFO    2019-06-22 00:34:41 LiftoverVcf Processed 1 variants.
INFO    2019-06-22 00:34:41 LiftoverVcf 0 variants failed to liftover.
INFO    2019-06-22 00:34:41 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO    2019-06-22 00:34:41 LiftoverVcf 0.0000% of variants were not successfully lifted over and written to the output.
INFO    2019-06-22 00:34:41 LiftoverVcf liftover success by source contig:
INFO    2019-06-22 00:34:41 LiftoverVcf 1: 1 / 1 (100.0000%)
INFO    2019-06-22 00:34:41 LiftoverVcf lifted variants by target contig:
INFO    2019-06-22 00:34:41 LiftoverVcf chr21: 1
WARNING 2019-06-22 00:34:41 LiftoverVcf 0 variants with a swapped REF/ALT were identified, but were not recovered.  See RECOVER_SWAPPED_REF_ALT and associated caveats.
INFO    2019-06-22 00:34:41 LiftoverVcf Writing out sorted records to final VCF.
[Sat Jun 22 00:34:41 CDT 2019] picard.vcf.LiftoverVcf done. Elapsed time: 0.18 minutes.

with VCF good_simple.vcf (same header as previous example)

##bcftools_viewCommand=view -h 65_1.vcf; Date=Sat Jun 22 00:14:22 2019
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   143283452   .   A   ACACG,* 275.1   VQSRTrancheINDEL99.00to99.90    .

Resources I'm using:

  1. chain file from https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain
  2. reference from ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
  3. Sequence dictionaries from picard CreateSequenceDictionary

File provenance: GRCh37 GVCFs generated by DRAGEN variant caller in --vc-emit-ref-confidence GVCF mode, joint-genotyped with other samples with GATK 3.8-1-0-gf15c1c3ef

I've also tried:

  1. Lifting over from b37 -> hg19 (successful) and then hg19 -> hg38 (same failure) using the chain files and hg19 reference from UCSC; all of the above using the reference from the GATK Resource Bundles.
  2. Adjusting LIFTOVER_MIN_MATCH which results in no variants successfully mapping (preventing the java error)
  3. Adjusting RECOVER_SWAPPED_REF which has no effect on this error
  4. CrossMap (v. 3.4 runs into python errors; v. 3.3 has mapping problems with b37 when "chr" isn't used in chromosome names)

Any advice would be appreciated! Thanks. :)

This Issue was generated from your forums

fleharty commented 5 years ago

@bhanugandham I believe this was fixed by https://github.com/broadinstitute/picard/pull/1339

This PR has been merged, but we haven't had a Picard release since then. I would recommend creating a new release, (@snovod do you think we could get a new release?).

Alternatively we could ask the user to create a new jar directly from source, that should work.

bhanugandham commented 5 years ago

@fleharty Thank you for the update. Should I close this issue ticket then?

fleharty commented 5 years ago

@bhanugandham There is a new release of picard at, https://github.com/broadinstitute/picard/releases/tag/2.20.3 that should resolve this issue.

bhanugandham commented 5 years ago

great! thank you @fleharty