bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

GATK4 GatherVcfs errors caused by Structural Variants Misplaced in VarDicts vcf files. #3321

Open sfpacman opened 4 years ago

sfpacman commented 4 years ago

Version info

bcbio_nextgen.py -n 16 sample_yaml

Your sample configuration file:

upload:
  dir: /mnt/speed/yup_temp/bcbio/test/fastqs
details:
  - files: [NA12878_R1.fastq.gz,NA12878_R2.fastq.gz]
    description: wgs_NA12878
    analysis: 'variant'
    genome_build: hg38
    lane: wgs_NA12878
    algorithm:
      aligner: bwa
      trim_reads: false
      mark_duplicates: true
      recalibrate: true
      realign: false
      variantcaller: [gatk-haplotype,freebayes,vardict,strelka2]
      svcaller: [gatk-cnv,manta]
      ensemble:
       numpass: 1
       use_filtered: false
      tools_off: [gemini]
      tools_on: [qualimap_full, picard]

Observed behavior Error message or bcbio output: It seems like the structural(Inversion) vairants in Vardict vcfs are processed improperly which cause the GatherVcfs not working

INFO    2020-08-09 16:51:33     GatherVcfs      Checking inputs.
INFO    2020-08-09 16:51:33     GatherVcfs      Checking file headers and first records to ensure compatibility.
ERROR   2020-08-09 16:51:34     GatherVcfs      There was a problem with gathering the INPUT.java.lang.IllegalArgumentException: First record in file /mnt/speed/yup_temp/bcbio/test/fastqs/vardict/chr1/wgs_NA12878-chr1_218612889_236431639.vcf.gz is not after first record in previous file /mnt/speed/yup_temp/bcbio/test/fastqs/vardict/chr1/wgs_NA12878-chr1_202392014_218612588.vcf.gz
[Sun Aug 09 16:51:34 PDT 2020] picard.vcf.GatherVcfs done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=691404800
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Tool returned: 1
Using GATK jar /mnt/speed/yup_temp/bcbio/anaconda/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms681m -Xmx41694m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/speed/yup_temp/bcbio/test/fastqs/bcbiotx/tmplt0f8bfy -jar /mnt/speed/yup_temp/bcbio/anaconda/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar GatherVcfs -I /mnt/speed/yup_temp/bcbio/test/fastqs/vardict/wgs_NA12878-files.list -O /mnt/speed/yup_temp/bcbio/test/fastqs/bcbiotx/tmp1noi1tat/wgs_NA12878.vcf.gz
' returned non-zero exit status 4
[wgs_NA12878-chr1_218612889_236431639.vcf.gz](https://github.com/bcbio/bcbio-nextgen/files/5048672/wgs_NA12878-chr1_218612889_236431639.vcf.gz)

Expected behavior The varaints should be placed in a correct vardict vcf file

Log files

Additional context Attach the vcf in question. The first two varintas are causing trouble in wgs_NA12878-chr1_218612889_236431639.vcf.gz. I manually fixed one variant in a different VarDict vcf but it turns it is not an isolated incident.

chr1    16725245        .       G       <INN>   161     PASS    SAMPLE=wgs_NA12878;TYPE=INV;DP=143;VD=48;AF=0.3357;BIAS=2:2;REFBIAS=39:55;VARBIAS=19:29;PMEAN=37.8;PSTD=1;QUAL=29;QSTD=1;SBF=0.85856;ODDRATIO=1.08167;MQ=60;SN=47;HIAF=0.3406;ADJAF=0.3357;SHIFT3=0;MSI=0;MSILEN=0;NM=0.2;HICNT=47;HICOV=138;LSEQ=AAAGCCCGCCGGCTTCTGCA;RSEQ=ACATGTCAGAGCTCTCTTTG;DUPRATE=0;SVTYPE=INV;SVLEN=218051196;SPLITREAD=48;SPANPAIR=0   GT:DP:VD:AD:AF:RD:ALD   0/1:143:48:94,48:0.3357:39,55:19,29
chr1    113416470       .       T       <INN>   69      PASS    SAMPLE=wgs_NA12878;TYPE=INV;DP=27;VD=5;AF=0.1852;BIAS=2:2;REFBIAS=17:9;VARBIAS=2:3;PMEAN=10.6;PSTD=1;QUAL=30;QSTD=1;SBF=0.34986;ODDRATIO=2.73376;MQ=25;SN=10;HIAF=0.1724;ADJAF=0.1852;SHIFT3=0;MSI=0;MSILEN=0;NM=3;HICNT=5;HICOV=29;LSEQ=ATATATTAATTATATATAAT;RSEQ=TATATAAATATATATATTAA;DUPRATE=0;SVTYPE=INV;SVLEN=106544302;SPLITREAD=4;SPANPAIR=3     GT:DP:VD:AD:AF:RD:ALD   0/1:27:5:26,5:0.1852:17,9:2,3
chr1    218612571       .       GCAAAAGCTTTGTTTCTTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTTGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGGGATCTCGGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCCATTCTCCTGCCTCAGCCTCCCAAGCAGCTGGGACTACAGGCACCCGCCACCACGCCCGGCTAATTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCGTTTTAGCCGGGATGGTCTCGATCTCCTGACCTCGTGATCCGCCCGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGCGCCCGGC       G       130     PASS    SAMPLE=wgs_NA12878;TYPE=Deletion;DP=8;VD=22;AF=1;BIAS=2:2;REFBIAS=2:1;VARBIAS=9:13;PMEAN=32;PSTD=1;QUAL=29.3;QSTD=1;SBF=0.56478;ODDRATIO=2.76781;MQ=59.4;SN=44;HIAF=0.9565;ADJAF=1;SHIFT3=17;MSI=2;MSILEN=1;NM=1;HICNT=22;HICOV=23;LSEQ=GGCATTTAAAGACCGCAAAG;RSEQ=CAAAAGCTTTGTTTCTTCCA;DUPRATE=0;SPLITREAD=17;SPANPAIR=0        GT:DP:VD:AD:AF:RD:ALD   1/1:8:22:3,22:1:2,1:9,13
naumenko-sa commented 4 years ago

Hi @sfpacman!

Could you please try analysis: variant2 instead of analysis: variant https://bcbio-nextgen.readthedocs.io/en/latest/contents/somatic_variants.html?highlight=variant2#review-parameters-in-the-yaml-file

The next issue might be gatk-cnv. in this use case (WGS SV detection) we work with manta, lumpy, delly, wham, cnvkit. gatk-cnv is for tumor/normal or tumor + panel of normals use case.

Sergey

sfpacman commented 4 years ago

@naumenko-sa Thanks for the reply I tried your suggestion - I switched to variant2 and re-run only with vardict. I still have the same problem with structural variant placed in an incorrect vcf file .

naumenko-sa commented 4 years ago

Thanks! Could you attach bcbio-nextgen-commands.log to see where exactly it breaks?

Please also take a look here: https://github.com/bcbio/bcbio-nextgen/issues/3332 S

sfpacman commented 4 years ago

Log is attached. ( last 620 lines) 202009_bcbio-nextgen-commands.log