PacificBiosciences / paraphase

HiFi-based caller for highly similar paralogous genes
BSD 3-Clause Clear License
23 stars 4 forks source link

paraphase output VCFs cannot be parsed by htslib tools #16

Closed gabeng closed 2 months ago

gabeng commented 4 months ago

Tools that rely on htslib like IGV and GATK return errors when loading VCF files generated by paraphase 3.0 0. The example is HG001. HG001_CBS_variants.vcf.gz

IGV

WARNING [Feb. 11,2024 17:46] [Variant] NumberFormatException on line: chr21 43052880    .   CAAAAAAAAAA C   .   PASS    HapIDs=CBS_hap1,CBS_hap2    GT:DP:AD    ./.:39,56:.,32 
 Attempting to reformat by replacing ,., with ,0,
SEVERE [Feb. 11,2024 17:46] [TrackLoader] Error parsing LineIteratorImpl(SynchronousLineReader) at the first record, for input source: HG001_CBS_variants.vcf
SEVERE [Feb. 11,2024 17:46] [TrackLoader] htsjdk.tribble.TribbleException$MalformedFeatureFile: Error parsing LineIteratorImpl(SynchronousLineReader) at the first record, for input source: HG001_CBS_variants.vcf
    at htsjdk@3.0.0/htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:397)
    at htsjdk@3.0.0/htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.<init>(TribbleIndexedFeatureReader.java:343)
    at htsjdk@3.0.0/htsjdk.tribble.TribbleIndexedFeatureReader.iterator(TribbleIndexedFeatureReader.java:310)
    at org.igv/org.broad.igv.feature.tribble.TribbleReaderWrapper.iterator(TribbleReaderWrapper.java:82)
    at org.igv/org.broad.igv.track.TribbleFeatureSource$NonIndexedFeatureSource.<init>(TribbleFeatureSource.java:338)
    at org.igv/org.broad.igv.track.TribbleFeatureSource.getFeatureSource(TribbleFeatureSource.java:134)
    at org.igv/org.broad.igv.track.TribbleFeatureSource.getFeatureSource(TribbleFeatureSource.java:72)
    at org.igv/org.broad.igv.track.TrackLoader.loadVCF(TrackLoader.java:359)
    at org.igv/org.broad.igv.track.TrackLoader.loadTribbleFile(TrackLoader.java:455)
    at org.igv/org.broad.igv.track.TrackLoader.load(TrackLoader.java:205)
    at org.igv/org.broad.igv.ui.IGV.load(IGV.java:1249)
    at org.igv/org.broad.igv.ui.IGV.lambda$load$5(IGV.java:1281)
    at org.igv/org.broad.igv.util.LongRunningTask.call(LongRunningTask.java:72)
    at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NumberFormatException: For input string: "39,56"
    at java.base/java.lang.NumberFormatException.forInputString(Unknown Source)
    at java.base/java.lang.Integer.parseInt(Unknown Source)
    at java.base/java.lang.Integer.parseInt(Unknown Source)
    at htsjdk@3.0.0/htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:820)
    at htsjdk@3.0.0/htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:121)
    at htsjdk@3.0.0/htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
    at htsjdk@3.0.0/htsjdk.variant.variantcontext.LazyGenotypesContext.ensureSampleNameMap(LazyGenotypesContext.java:180)
    at htsjdk@3.0.0/htsjdk.variant.variantcontext.GenotypesContext.getSampleNames(GenotypesContext.java:646)
    at htsjdk@3.0.0/htsjdk.variant.variantcontext.VariantContext.getSampleNames(VariantContext.java:1109)
    at org.igv/org.broad.igv.variant.vcf.VCFVariant.getSampleNames(VCFVariant.java:277)
    at org.igv/org.broad.igv.variant.vcf.VCFVariant.init(VCFVariant.java:73)
    at org.igv/org.broad.igv.variant.vcf.VCFVariant.<init>(VCFVariant.java:66)
    at org.igv/org.broad.igv.feature.tribble.VCFWrapperCodec.decode(VCFWrapperCodec.java:86)
    at org.igv/org.broad.igv.feature.tribble.VCFWrapperCodec.decode(VCFWrapperCodec.java:43)
    at htsjdk@3.0.0/htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
    at htsjdk@3.0.0/htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
    at htsjdk@3.0.0/htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:376)
    ... 16 more

gatk ValidateVariants

Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms1699m -Xmx8000m -XX:+UseSerialGC -Djava.io.tmpdir=/data/ -jar gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar ValidateVariants -R hg38-noalt.fa --warn-on-errors -V HG001_CBS_variants.vcf
12:55:22.504 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Dec 30, 2023 12:55:22 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
12:55:22.670 INFO  ValidateVariants - ------------------------------------------------------------
12:55:22.671 INFO  ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.9.0
12:55:22.671 INFO  ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
12:55:22.672 INFO  ValidateVariants - Initializing engine
12:55:22.926 INFO  FeatureManager - Using codec VCFCodec to read file file:///HG001_CBS_variants.vcf
12:55:22.938 INFO  ValidateVariants - Done initializing engine
12:55:22.938 WARN  ValidateVariants - IDS validation cannot be done because no DBSNP file was provided
12:55:22.938 WARN  ValidateVariants - Other possible validations will still be performed
12:55:22.938 INFO  ProgressMeter - Starting traversal
12:55:22.938 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
12:55:22.960 INFO  ValidateVariants - Shutting down engine
[December 30, 2023 12:55:22 PM GMT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=1723334656
java.lang.NumberFormatException: For input string: "39,56"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:820)
        at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:121)
        at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
        at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
        at htsjdk.variant.variantcontext.GenotypesContext.iterator(GenotypesContext.java:465)
        at htsjdk.variant.variantcontext.VariantContext.validateAlternateAlleles(VariantContext.java:1303)
        at org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants.applyValidationType(ValidateVariants.java:383)
        at org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants.apply(ValidateVariants.java:263)
        at org.broadinstitute.hellbender.engine.VariantWalker.lambda$traverse$0(VariantWalker.java:104)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1049)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
xiao-chen-xc commented 4 months ago

Hi @gabeng thanks for reporting this issue. This will be addressed in the next release of Paraphase.

gabeng commented 2 months ago

Hi @xiao-chen-xc, Do you have any info on a timeline when to expect the next release? Unfortunately, the issue renders the output quite unusable for downstream analysis and benchmarking.

xiao-chen-xc commented 2 months ago

I'm aiming to release by the end of this week.

xiao-chen-xc commented 2 months ago

Please try v3.1.1. If you run into problems or have any suggestions on further improving the VCF, please don't hesitate to let me know!

gabeng commented 2 months ago

Hi @xiao-chen-xc, I can confirm the issue has been solved. Thank you!