TNTurnerLab / Tortoise

Tortoise is the CPU workflow of the HAT https://github.com/TNTurnerLab/HAT tools.
MIT License
5 stars 0 forks source link

Tortoise stopped with error #2

Closed lauragails closed 11 months ago

lauragails commented 1 year ago

Hello!

I am running Tortoise on pacbio long read trio to call de novo variants. It ran for about 2.1 days, and I think made it very far! From what I can see the haplotype calling finished:

although I see a lot of rows like: 15:14:51.985 WARN DepthPerSampleHC - Annotation will not be calculated at position chr1:10616 and possibly subsequent; genotype for sample PBG_3484_0000009718 is not called

This is ultimately why I think it finished successfully:

09:45:47.868 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 63.23320811800001
09:45:47.868 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 27991.821773314
09:45:47.868 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 24901.88 sec
09:45:47.869 INFO  HaplotypeCaller - Shutting down engine
[June 2, 2023 9:45:47 AM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 2,551.14 minutes.
Runtime.totalMemory()=10189012992

But the biggest issue from my perspective is this error that appeared (for each sample analyzed):

A USER ERROR has occurred: Contig chr1_KI270706v1_random not present in the sequence dictionary [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /hpc/packages/minerva-centos7/gatk/4.2.0.0/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx64g -jar /hpc/packages/minerva-centos7/gatk/4.2.0.0/gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar HaplotypeCaller -R /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/files/references/GRCh38.chrom.fasta -I /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/Tortoise/052023/data_dir/PBG_3484_0000009726.GRCh38.bam -O /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/Tortoise/052023/dnv_wf_cpu/gatk_out/PBG_3484_0000009726.gatk.cpu.g.vcf.gz --standard-min-confidence-threshold-for-calling 30 -ERC GVCF
[Fri Jun  2 09:45:47 2023]
Error in rule gatk:
    jobid: 1
    output: /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/Tortoise/052023/dnv_wf_cpu/gatk_out/PBG_3484_0000009726.gatk.cpu.g.vcf.gz
    shell:

    export PATH=/opt/conda/bin:$PATH

    mkdir -p /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/Tortoise/052023/dnv_wf_cpu/gatk_out

    gatk --java-options "-Xmx64g" HaplotypeCaller -R /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/files/references/GRCh38.chrom.fasta -I /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/Tortoise/052023/data_dir/PBG_3484_0000009726.GRCh38.bam -O /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/Tortoise/052023/dnv_wf_cpu/gatk_out/PBG_3484_0000009726.gatk.cpu.g.vcf.gz --standard-min-confidence-threshold-for-calling 30 -ERC GVCF

    touch /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/Tortoise/052023/dnv_wf_cpu/gatk_out/PBG_3484_0000009726.gatk.cpu.g.vcf.gz.tbi

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job gatk since they might be corrupted:
/sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/Tortoise/052023/dnv_wf_cpu/gatk_out/PBG_3484_0000009726.gatk.cpu.g.vcf.gz

Is there a way I can get this to work without making new bamfiles to remove all of the contigs that are not referenced in the dictionary? Thank you for checking/helping me debug!

jng2 commented 1 year ago

Hi there!

GATK has an interval command, -L (https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists), which allows you to pick which contigs you would like to use. So adding -L to the GATK call should help with your issue.

Please let me know if this doesn't work.

Best,

Jeff

lauragails commented 1 year ago

Thank you! Now that I know what to look for I see that I can add INTERVAL_FILE in as the inputs (ie to read into line 83 in tortoise_1.2.smk)

that is "interval_file" in config.json.

I checked the other variables in config.json and I am sure that the other ones are either correct, or correctly left empty, so knock on wood that should do it!

The reason (I think) I got this error is that I generated a dictionary with the reference fasta that came with the pacbio reference directory (which only includes the "regular" ie non-scaffold etc. chromosomes). Do you know if there's a way to "save" this info? If not, because the dictionary only contains the "regular" chroms, I'll restrict to those regions and re-run.

The quick response to my original question is especially appreciated. I wish I asked earlier!

jng2 commented 1 year ago

Sorry, I'm not sure what you mean by "save" this info. I think restricting the run to the "regular" chromosomes should do the trick.

lauragails commented 1 year ago

I mean being able to use the information that's not in the dictionary generated from a basic reference file. Maybe you could do it by making a more complex .list file, since the manual says you don't need a dictionary if you have a list with chromosome and start/stop position.

But fortunately it also allows for the inputting of "chr__" without chr/pos, as long as a dictionary is specified. So fingers crossed, it will work!

Thanks again.

lauragails commented 1 year ago

Progress! But when I run on hpc using singularity and google deep variant that is installed there, whose versions match the ones you suggest, I get a module not found error (below)

It is entirely possible/probable that it's because python 3.9 is not installed (yet). Also, I don't think we can load two modules at once (unless a python 2 is called using "pyton2" vs. "python" which implies whatever 3.XX module is loaded. I am noting this because the front page suggests that both be installed.

Do you know where this module would be used, and what software I can use to make sure it is installed: third_party.nucleus.io.clif_postproc

Thank you!

I0624 03:06:10.781211 47766986753856 postprocess_variants.py:971] Using sample name from call_variants output. Sample name: [redacted]
2023-06-24 03:06:10.786312: I deepvariant/postprocess_variants.cc:88] Read from: /tmp/tmpncogiwdj/call_variants_output.tfrecord.gz
2023-06-24 03:06:56.226107: I deepvariant/postprocess_variants.cc:103] Total #entries in single_site_calls = 11112971
I0624 03:08:29.075627 47766986753856 postprocess_variants.py:1036] CVO sorting took 2.3048147598902387 minutes
I0624 03:08:29.077477 47766986753856 postprocess_variants.py:1039] Transforming call_variants_output to variants.
I0624 03:33:55.377552 47766986753856 postprocess_variants.py:1079] Processing variants (and writing to temporary file) took 25.438181654612222 minutes
I0624 03:47:44.937586 47766986753856 postprocess_variants.py:1092] Finished writing VCF and gVCF in 13.825993386904399 minutes.
I0624 03:47:44.998353 47766986753856 genomics_reader.py:222] Reading /path/to/redacted.cpu.dv.vcf.gz with NativeVcfReader
I0624 03:54:05.123418 47766986753856 postprocess_variants.py:1103] Generating VCF stats took 6.336423444747925 minutes.

real    47m58.453s
user    47m8.501s
sys     0m48.710s
 File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/absl_py/absl/app.py", line 300, in run
  File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/absl_py/absl/app.py", line 251, in _run_main
  File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1099, in main
  File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/com_google_deepvariant/third_party/nucleus/io/genomics_reader.py", line 244, in iterate
  File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/com_google_deepvariant/third_party/nucleus/io/vcf.py", line 201, in iterate
ModuleNotFoundError: No module named 'third_party.nucleus.io.clif_postproc'

real    38m25.744s
user    37m57.789s
sys     0m26.734s

Edit: This was in postprocess_variants.log it does appear that call_variants.log and make_examples.log look good, and indexed vcf files were created (that visually look good as well, I haven't dug into them)

jng2 commented 1 year ago

Hi,

Sorry about the delayed response. I'm not too sure where you would find that, I believe that is downloaded when DeepVariant is built based on some Googling. Here is an (old) closed issue from DeepVariant where I think the person had a similar issue to yours I think: https://github.com/google/deepvariant/issues/103

jng2 commented 11 months ago

I'm going to close this issue. Please let me know if you have further problems.

lauragails commented 10 months ago

I was able to get this working! The original error was because the tmp directory was full.

When that was fixed, I got a separate internal issue because I can't use python2 and python3 at the same time (when you loaded the python3 module on our system it messed up a behind-the-hood file for python2 so the final scripts I pushed through by hand).

Which leads me to: I see for the CPG file and _position file in your described outputs you're grepping lines that have a comment and piping them to the file, but I don't see any output going there? It could be because I'm copy pasting.

But I do see the de novos that I need(!!) so I'm not concerned about this per-se.

Thank you so much!