HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
243 stars 26 forks source link

gvcf and gatk #151

Closed shinlin77 closed 1 year ago

shinlin77 commented 1 year ago

I've run_clair3.sh etc. with the --gvcf flag. However, when I run the output phased_merge_output.vcf.gz in to gatk CombineGVCFs, I get the following error

A USER ERROR has occurred: The list of input alleles must contain as an allele but that is not the case at position 10108; please use the Haplotype Caller with gVCF output to generate appropriate records

Is there a problem with the clair3 output? Thanks.

tuannguyen8390 commented 1 year ago

Hi @shinlin77,

To clarify is this about the step where you run to merge multiple samples together ?

gatk CombineGVCFs -R ${genome} --variant sample1.gvcf.gz --variant sample2.gvcf.gz -O Combine.gvcf.gz

Your comment mention using phased_merge_output.vcf.gz. CombineGVCFs only works with gvcf.gz file I believe ?

shinlin77 commented 1 year ago

Oh OK, I see. I'm supposed to use merge_output.gvcf.gz. Is the phase info recoverable from this output? Or is there supposed to be such thing as a phased_merge_output.gcvf.gz?

tuannguyen8390 commented 1 year ago

Depends on what you are looking to do, I guess? Are you doing population scale phasing? After CombineGVCF & GenotypeGVCF I believe you can rephase with Eagle/Beagle if you are just interested in SNP. If you have structural variant data, you can then split the VCF file per individual & co-phasing of SNP/SV with LongPhase.

Modernism-01 commented 7 months ago

Hi @tuannguyen8390,

I finished the gvcf calling with below command. However, when I run the gatk CombineGVCFs, it was failed, and I did not find that others suffer from the similar problem. could you help to give me some suggestions?

the Clair3 command:

singularity exec \
  -B ${INPUT_DIR}:${INPUT_DIR} \
  -B ${OUTPUT_DIR}:${OUTPUT_DIR} \
  $IMAGE_PATH/clair3_latest.sif \
  /opt/bin/run_clair3.sh \
  --bam_fn=${INPUT_DIR}/${BAM} \
  --ref_fn=${REF} \
  --threads=${THREADS} \
  --platform="ont" \
  --model_path="/opt/models/r941_prom_sup_g5014" \
  --output=${OUTPUT_DIR}/${SLURM_JOB_NAME} \
  --sample_name=${SLURM_JOB_NAME} \
  --print_ref_calls \
  --gvcf \
  --no_phasing_for_fa \
  --include_all_ctgs

the gatk CombineGVCFs command:

gatk CombineGVCFs \
   -R $REF \
   --variant $gvcf1 \
   --variant $gvcf2 \
   --variant $gvcf3 \
   --variant $gvcf4 \
   --variant $gvcf5 \
   -O $output1 && echo "Combine completed!" || { echo "Combine failed!"; exit 1; }

gatk --java-options "-Xmx64g" GenotypeGVCFs \
   -R $REF \
   -V $output1 \
   -O $output2 && echo "GenotypeGVCFs completed!" ||  { echo "GenotypeGVCFs failed!"; exit 1; }

the log file for running gatk CombineGVCFs:

15:22:44.435 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/public/home/zenglingsen/04.software/02.Anaconda/Or/envs/pytorch/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so 15:22:45.443 INFO CombineGVCFs - ------------------------------------------------------------ 15:22:45.449 INFO CombineGVCFs - The Genome Analysis Toolkit (GATK) v4.5.0.0 15:22:45.449 INFO CombineGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/ 15:22:45.450 INFO CombineGVCFs - Executing as zenglingsen@comput75 on Linux v3.10.0-862.el7.x86_64 amd64 15:22:45.450 INFO CombineGVCFs - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src 15:22:45.451 INFO CombineGVCFs - Start Date/Time: 2024年3月16日 CST 下午3:22:44 15:22:45.451 INFO CombineGVCFs - ------------------------------------------------------------ 15:22:45.451 INFO CombineGVCFs - ------------------------------------------------------------ 15:22:45.453 INFO CombineGVCFs - HTSJDK Version: 4.1.0 15:22:45.453 INFO CombineGVCFs - Picard Version: 3.1.1 15:22:45.453 INFO CombineGVCFs - Built for Spark Version: 3.5.0 15:22:45.454 INFO CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2 15:22:45.455 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 15:22:45.455 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 15:22:45.456 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 15:22:45.456 INFO CombineGVCFs - Deflater: IntelDeflater 15:22:45.456 INFO CombineGVCFs - Inflater: IntelInflater 15:22:45.457 INFO CombineGVCFs - GCS max retries/reopens: 20 15:22:45.457 INFO CombineGVCFs - Requester pays: disabled 15:22:45.458 INFO CombineGVCFs - Initializing engine 15:22:45.894 INFO FeatureManager - Using codec VCFCodec to read file file:///public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/AW.sorted.gvcf.gz 15:22:46.196 INFO FeatureManager - Using codec VCFCodec to read file file:///public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/BKS.sorted.gvcf.gz 15:22:46.320 INFO FeatureManager - Using codec VCFCodec to read file file:///public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/BMA.sorted.gvcf.gz 15:22:46.453 INFO FeatureManager - Using codec VCFCodec to read file file:///public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/LW.sorted.gvcf.gz 15:22:46.670 INFO FeatureManager - Using codec VCFCodec to read file file:///public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/PTR.sorted.gvcf.gz 15:22:48.033 INFO CombineGVCFs - Done initializing engine 15:22:48.060 INFO ProgressMeter - Starting traversal 15:22:48.060 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 15:22:48.271 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location NC_010443.5:112 the annotation F=true was not a numerical value and was ignored 15:22:58.069 INFO ProgressMeter - NC_010443.5:510839 0.2 1914000 11480555.8 15:23:08.069 INFO ProgressMeter - NC_010443.5:1071766 0.3 4058000 12169132.3 .......... .......... .......... 02:55:06.284 INFO ProgressMeter - NC_010462.3:43284166 692.3 8628619000 12463632.1 02:55:31.047 INFO CombineGVCFs - Shutting down engine [2024年3月17日 CST 上午2:55:32] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 692.81 minutes. Runtime.totalMemory()=7079985152 *org.broadinstitute.hellbender.exceptions.GATKException: Exception thrown at NW_018084777.1:92144 [VC /public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/AW.sorted.gvcf.gz @ NW_018084777.1:92144-92146 Q0.00 of type=SYMBOLIC alleles=[A, ] attr={END=92146} GT=GT:GQ:MIN_DP:PL 0/0:4:1:0,3,29 filters= at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$traverse$1(MultiVariantWalker.java:145) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) at org.broadinstitute.hellbender.engine.MultiVariantWalker.traverse(MultiVariantWalker.java:136) at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:165) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1098) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209) at org.broadinstitute.hellbender.Main.main(Main.java:306) Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -3 is negative** at java.base/java.lang.System.arraycopy(Native Method) at java.base/java.util.Arrays.copyOfRange(Arrays.java:3823) at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.createIntermediateVariants(CombineGVCFs.java:228) at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.apply(CombineGVCFs.java:174) at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.apply(MultiVariantWalkerGroupedOnStart.java:133) at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.apply(MultiVariantWalkerGroupedOnStart.java:108) at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$traverse$1(MultiVariantWalker.java:139) ... 21 moreUsing GATK jar /public/home/zenglingsen/04.software/02.Anaconda/Or/envs/pytorch/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /public/home/zenglingsen/04.software/02.Anaconda/Or/envs/pytorch/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar CombineGVCFs -R /public/home/zenglingsen/01.data/03.Reference/GCF_000003025.6_Sscrofa11.1_genomic.fna --variant /public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/AW.sorted.gvcf.gz --variant /public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/BKS.sorted.gvcf.gz --variant /public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/BMA.sorted.gvcf.gz --variant /public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/LW.sorted.gvcf.gz --variant /public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/PTR.sorted.gvcf.gz -O /public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/04.clair3/05.gvcf_sorted/benchmark.merged.gvcf.gz

the version of gatk is: The Genome Analysis Toolkit (GATK) v4.5.0.0 HTSJDK Version: 4.1.0 Picard Version: 3.1.1

the details for NW_018084777.1:92144-92146 position in AW gvcf file:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AW

NW_018084777.1 92144 . A 0 . END=92146 GT:GQ:MIN_DP:PL 0/0:4:1:0,3,29

tuannguyen8390 commented 6 months ago

I'm ... not too sure actually. I don't recall running into any of similar problem unfortunately. I used v4.3.0.0 and clair 1.0.4 and didn't see this so wondering if this is something introduced later ?

aquaskyline commented 6 months ago

The position record looks fine. Java complained 'Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -3 is negative'. It's hard to pinpoint where the problem is, but the call stack seems to suggest the problem happened in an early stage of GATK (CombineGVCFs.createIntermediateVariants(CombineGVCFs.java:228)). The length -3 is negative error can either be bug in GATK or some mistakes in the VCF or BAM. A google search led to this thread suggesting that BAM can cause such a problem in Picard.