broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

GATK 4.1.4 for joint calling of hifi and illumina reads #8372

Open YanisChrys opened 1 year ago

YanisChrys commented 1 year ago

Hello,

I am trying to use gatk/4.1.4.1 and picard/2.22.0 to do joint variant calling of PacBio HiFi reads and Illumina short-reads. My pipeline is basically the recommended one (without base recalibration) where after HaplotypeCaller, I use GenomicsDBImport and GenotypeGVCFs and end up with the vcf file.

The HiFI reads have normal hifi phred scores up to 93 and the illumina are encoded with phred33 quality scores.

The output of the joint variant calling is strange and it causes issues when I try to use it with tools like plink and it makes me wonder if the joint variant calling went badly as well. This is even after variant filtration where I filter for QUAL<30 and QD<2

Here is an example of what the first few lines of my vcf calls look like. Notice the QUAL column that looks like Num,Num:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Illumina_sample0        sample1      sample2      sample_3  sample4     sample5     sample6      hifi_sample
Chr1    10608   .       GCA     G       174,34  PASS    AC=2;AF=0.143;AN=14;BaseQRankSum=0.00;DP=55;ExcessHet=3.3579;FS=7.404;MLEAC=2;MLEAF=0.143;MQ=57.97;MQRankSum=0.00;QD=15.85;ReadPosRankSum=1.00;SOR=3.611    GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0       0/1:3,3:6:99:117,0,117  0/0:5,0:5:15:0,15,141   0/1:3,2:5:71:71,0,120   0/0:3,0:3:9:0,9,113 0/0:7,0:7:21:0,21,190   0/0:5,0:5:15:0,15,175   0/0:20,0:20:60:0,60,900
Chr1    10616   .       G       A       903,42  PASS    AC=8;AF=0.571;AN=14;BaseQRankSum=0.00;DP=66;ExcessHet=0.7136;FS=9.277;MLEAC=9;MLEAF=0.643;MQ=59.80;MQRankSum=0.00;QD=20.08;ReadPosRankSum=1.00;SOR=0.251    GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0       0/0:10,0:10:30:0,30,367 1/1:0,5:5:15:141,15,0   0/1:2,3:5:75:110,0,75   0/0:3,0:3:9:0,9,113 1/1:0,8:8:24:232,24,0   1/1:0,7:7:21:224,21,0   0/1:14,6:20:99:210,0,570

I believe this can be reproduced with any illumina+hifi samples. Are you aware of this problem? Could the snp calling be wrong because of the different phred score encoding? What do you suggest to do in these cases?

I have read your response in this issue: https://gatk.broadinstitute.org/hc/en-us/community/posts/360072716972-Variant-calling-with-PacBio-HiFi-reads and have used minimap2 in a similar way. (minimap2 -acyYL --secondary=no --MD --eqx -x asm20 -k 20)

YanisChrys commented 12 months ago

Hello, this is still an issue for me as I am still getting two values for the QUAL column even when not using hifi reads but using paired and unpaired illumina together, instead. Is there a proposed solution to this?

astridboehne commented 11 months ago

Here the error log Using GATK jar /share/scientific_bin/gatk/4.1.4.1/gatk-package-4.1.4.1-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /share/scientific_bin/gatk/4.1.4.1/gatk-package-4.1.4.1-local.jar IndexFeatureFile -I output/called/final/allsites.filtered.vcf 00:57:08.257 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/scientific_bin/gatk/4.1.4.1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so Sep 11, 2023 12:57:08 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 00:57:08.789 INFO IndexFeatureFile - ------------------------------------------------------------ 00:57:08.789 INFO IndexFeatureFile - The Genome Analysis Toolkit (GATK) v4.1.4.1 00:57:08.789 INFO IndexFeatureFile - For support and documentation go to https://software.broadinstitute.org/gatk/ 00:57:08.790 INFO IndexFeatureFile - Executing as ychrysostomakis@compute-0-3.local on Linux v3.10.0-1160.53.1.el7.x86_64 amd64 00:57:08.790 INFO IndexFeatureFile - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_231-b11 00:57:08.790 INFO IndexFeatureFile - Start Date/Time: 11. September 2023 00:57:07 MESZ 00:57:08.790 INFO IndexFeatureFile - ------------------------------------------------------------ 00:57:08.790 INFO IndexFeatureFile - ------------------------------------------------------------ 00:57:08.790 INFO IndexFeatureFile - HTSJDK Version: 2.21.0 00:57:08.790 INFO IndexFeatureFile - Picard Version: 2.21.2 00:57:08.790 INFO IndexFeatureFile - HTSJDK Defaults.COMPRESSION_LEVEL : 2 00:57:08.790 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 00:57:08.790 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 00:57:08.790 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 00:57:08.790 INFO IndexFeatureFile - Deflater: IntelDeflater 00:57:08.790 INFO IndexFeatureFile - Inflater: IntelInflater 00:57:08.790 INFO IndexFeatureFile - GCS max retries/reopens: 20 00:57:08.790 INFO IndexFeatureFile - Requester pays: disabled 00:57:08.790 INFO IndexFeatureFile - Initializing engine 00:57:08.790 INFO IndexFeatureFile - Done initializing engine 00:57:09.166 INFO FeatureManager - Using codec VCFCodec to read file file:///share/pool/CompGenomVert/phoxy_snp_calling/VC_HIFI/output/called/final/allsites.filtered.vcf 00:57:09.250 INFO ProgressMeter - Starting traversal 00:57:09.250 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute 00:57:09.251 INFO IndexFeatureFile - Shutting down engine [11. September 2023 00:57:09 MESZ] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 0.02 minutes. Runtime.totalMemory()=2114453504 java.lang.NumberFormatException: For input string: "175,24" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) at java.lang.Double.parseDouble(Double.java:538) at htsjdk.variant.vcf.VCFUtils.parseVcfDouble(VCFUtils.java:262) at htsjdk.variant.vcf.AbstractVCFCodec.parseQual(AbstractVCFCodec.java:620) at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:422) at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384) at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328) at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48) at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70) at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37) at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:43) at org.broadinstitute.hellbender.utils.codecs.ProgressReportingDelegatingCodec.decodeLoc(ProgressReportingDelegatingCodec.java:46) at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:689) at htsjdk.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:606) at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:446) at org.broadinstitute.hellbender.tools.IndexFeatureFile.createAppropriateIndexInMemory(IndexFeatureFile.java:118) at org.broadinstitute.hellbender.tools.IndexFeatureFile.doWork(IndexFeatureFile.java:75) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206) at org.broadinstitute.hellbender.Main.main(Main.java:292) Full Traceback (most recent call last): File "/home/ychrysostomakis/.local/lib/python3.9/site-packages/snakemake/executors/init.py", line 2578, in run_wrapper run( File "/share/pool/CompGenomVert/phoxy_snp_calling/VC_HIFI/joint_snp_calling.smk", line 422, in __rule_index_feature_file File "/home/ychrysostomakis/.local/lib/python3.9/site-packages/snakemake/shell.py", line 300, in new raise sp.CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'set -euo pipefail;
module load gatk/4.1.4.1 gatk IndexFeatureFile -I output/called/final/allsites.filtered.vcf' returned non-zero exit status 3.