dmiller903 / CompoundHetVIP

A pipeline for the identification of Compound Heterozygous Variants
MIT License
8 stars 2 forks source link

Cannot generate _parsed.vcf.gz? #4

Closed greekkey closed 2 years ago

greekkey commented 2 years ago

Hi. I am now using my own data. But I have a problem in step 1. It seems that the program cannot finish this step to generate the _parsed.vcf.gz files for the parents. I also tried using gvcf files, the _parsed.vcf.gz can be generated but for the parents, the result files are still _parsed.vcf. And those "_parsed.vcf" file are binary files, and cannot be viewed. ( https://drive.google.com/drive/folders/13GVr5yHDAoHOwmvqHxHt8QWuN6EIuz9O?usp=sharing ) Does CompoundHetVIP have specific requirement for the input vcf / gvcf files? Jacob

dmiller903 commented 2 years ago

Did you name the suffix "_parsed.vcf"? The files are always gzipped, so it makes sense the files are binary. Even if you don't include ".gz" on the suffix, the file will still be gzipped. I was able to open your parent parsed files using vim and view the contents. Just add ".gz" to the parent output files. Or did you use the default parameters and ".gz" was not added to the end of the file for some reason?

greekkey commented 2 years ago

@dmiller903 Hi! I have solved the problem. You were right. The generated "_parsed.vcf" were actually gzipped files, so I could open those files by manually adding ".gz". However, those "_parsed.vcf.gz" could not be directly used in Step 2 because they were not "BGZF-compressed" but "gzip-compressed". I manually unzipped those files and generated "BGZF-compressed" using bcftools. Then I could finish Step 2.

Yes, I used the default settings in Step 1. Actually I used the same command as I tested your examples, just changed the name of input files. I don't know why this happened. Jacob

dmiller903 commented 2 years ago

Glad you were able to figure it out. The script is supposed to bgzip the file at the end. I'm not sure why it didn't. Do you have the log from the first output (the file in the Google Drive that ends in ".out" is just the docker log id), and what you put as input to execute the docker? I'll try to figure out why you're files were gzipped and not bgzipped.

greekkey commented 2 years ago

Hi, @dmiller903 I'm back. I'm sorry that I had to left office due to the new control of COVID-19 in my city. Before I left my office, I ran the program on a number of pedigrees. Now I got the results. I mainly used gvcf files. Some of the gvcf files were generated using the GATK pipeline, some were generated by the DRAGEN pipeline. I mainly encountered three problems.

greekkey commented 2 years ago

1) In Step 1. This is what I mentioned previously. When generating the _parsed.vcf.gz, it was always work well for the patient, but sometime failed for the parents. Very strange, I got a file named as "fa / mo_parsed.vcf" but actually a gzip compressed vcf.gz file. I can solve it by making a simple SHELL script to add .gz to the "vcf" file, unzip it and bgzip again, but it is just a temporary solution, if you can find the cause, there should be a better solution.

greekkey commented 2 years ago

2) In Step 5. I made the .fam file by capturing the Sample ID from the original vcf files, but still got a Error in this step, sanying "Mismatched IDs between --vcf file and ${Pedigree}_trio.fam". This only happened in SOME pedigrees. Then I manually changed the Sample ID in the original vcf file and started from the beginning, and it worked. But I still don't know why this happened. Does the program have special requirement for the file name and Sample ID?

greekkey commented 2 years ago
  1. In Step 2. At the last sub-step (GenotypeGVCFs) in Step2, an error "java.lang.NumberFormatException: For input string: "."" was encountered, and the {Pedigree}_trio.vcf.gz was very small with only a few lines in Chr1. I only met this problem with the DRAGEN-generated gcvf files. This is the most urgent one because I have no effective solution for this problem now.
greekkey commented 2 years ago

The complete information of this problem

Using GATK jar /root/miniconda2/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -jar /root/miniconda2/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar GenotypeGVCFs -R /references/Homo_sapiens_assembly38.fasta -V /tmp/temp.vcf.gz -O CS123/CS123_trio.vcf.gz 06:21:20.050 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/root/miniconda2/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_compression.so 06:21:20.463 INFO GenotypeGVCFs - ------------------------------------------------------------ 06:21:20.464 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.5.1 06:21:20.464 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/ 06:21:20.464 INFO GenotypeGVCFs - Executing as root@9cdc9052f275 on Linux v5.13.0-27-generic amd64 06:21:20.464 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_112-b16 06:21:20.465 INFO GenotypeGVCFs - Start Date/Time: January 26, 2022 6:21:20 AM UTC 06:21:20.465 INFO GenotypeGVCFs - ------------------------------------------------------------ 06:21:20.465 INFO GenotypeGVCFs - ------------------------------------------------------------ 06:21:20.465 INFO GenotypeGVCFs - HTSJDK Version: 2.15.1 06:21:20.465 INFO GenotypeGVCFs - Picard Version: 2.18.2 06:21:20.465 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2 06:21:20.465 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 06:21:20.465 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 06:21:20.465 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 06:21:20.466 INFO GenotypeGVCFs - Deflater: IntelDeflater 06:21:20.466 INFO GenotypeGVCFs - Inflater: IntelInflater 06:21:20.466 INFO GenotypeGVCFs - GCS max retries/reopens: 20 06:21:20.466 INFO GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 06:21:20.466 INFO GenotypeGVCFs - Initializing engine 06:21:20.849 INFO FeatureManager - Using codec VCFCodec to read file file:///tmp/temp.vcf.gz 06:21:21.014 INFO GenotypeGVCFs - Done initializing engine 06:21:21.083 INFO ProgressMeter - Starting traversal 06:21:21.084 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 06:21:21.210 INFO GenotypeGVCFs - Shutting down engine [January 26, 2022 6:21:21 AM UTC] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.02 minutes. Runtime.totalMemory()=644349952 java.lang.NumberFormatException: For input string: "." at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:569) at java.lang.Integer.parseInt(Integer.java:615) at org.broadinstitute.hellbender.tools.walkers.annotator.StrandBiasTest.encodeSBBS(StrandBiasTest.java:200) at org.broadinstitute.hellbender.tools.walkers.annotator.StrandBiasTest.getTableFromSamples(StrandBiasTest.java:89) at org.broadinstitute.hellbender.tools.walkers.annotator.FisherStrand.calculateAnnotationFromGTfield(FisherStrand.java:58) at org.broadinstitute.hellbender.tools.walkers.annotator.StrandBiasTest.annotate(StrandBiasTest.java:40) at org.broadinstitute.hellbender.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:270) at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:247) at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:201) at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:109) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:107) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:994) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

greekkey commented 2 years ago

I can't make the related files public on the internet. May I have your email address? I can share with you the related files directly.

dmiller903 commented 2 years ago

It looks like the program may be expecting an integer at a specific position in the file but there is a string instead. dmiller903@gmail.com

greekkey commented 2 years ago

@dmiller903 Thanks! I shared the files on google drive and sent the link to you via email. Please check.

dmiller903 commented 2 years ago

For step 1, there were scenarios where the chromosome wasn’t in the dictionary it was being added to and the script would break before it got to the bgzip step; it wasn't throwing an error like it should. I've fixed this.

For Step 2 I added an additional argument (—disable_strand_bias_test) to the script to get around a strand bias issue. Some VCF files may have a "." for strand bias and GATK doesn't know what to do with it because it's not an integer. Disabling the strand bias check using --disable_strand_bias_test y as part of the command should fix this.

For step 5, the samples in the .fam file must occur in the same order that they occur in the input VCF file. I added clarification about this to the Example pdf.

The github has been updated with the script fixes, and the Docker has been re-issued on Docker hub. It can be obtained using:

docker pull dmill903/compound-het-vip:latest or docker pull dmill903/compound-het-vip:1.1.1