Closed lauragails closed 11 months ago
Hi there!
GATK has an interval command, -L (https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists), which allows you to pick which contigs you would like to use. So adding -L to the GATK call should help with your issue.
Please let me know if this doesn't work.
Best,
Jeff
Thank you! Now that I know what to look for I see that I can add INTERVAL_FILE
in as the inputs (ie to read into line 83 in tortoise_1.2.smk
)
that is "interval_file" in config.json.
I checked the other variables in config.json and I am sure that the other ones are either correct, or correctly left empty, so knock on wood that should do it!
The reason (I think) I got this error is that I generated a dictionary with the reference fasta that came with the pacbio reference directory (which only includes the "regular" ie non-scaffold etc. chromosomes). Do you know if there's a way to "save" this info? If not, because the dictionary only contains the "regular" chroms, I'll restrict to those regions and re-run.
The quick response to my original question is especially appreciated. I wish I asked earlier!
Sorry, I'm not sure what you mean by "save" this info. I think restricting the run to the "regular" chromosomes should do the trick.
I mean being able to use the information that's not in the dictionary generated from a basic reference file. Maybe you could do it by making a more complex .list file, since the manual says you don't need a dictionary if you have a list with chromosome and start/stop position.
But fortunately it also allows for the inputting of "chr__" without chr/pos, as long as a dictionary is specified. So fingers crossed, it will work!
Thanks again.
Progress! But when I run on hpc using singularity and google deep variant that is installed there, whose versions match the ones you suggest, I get a module not found error (below)
It is entirely possible/probable that it's because python 3.9 is not installed (yet). Also, I don't think we can load two modules at once (unless a python 2 is called using "pyton2" vs. "python" which implies whatever 3.XX module is loaded. I am noting this because the front page suggests that both be installed.
Do you know where this module would be used, and what software I can use to make sure it is installed: third_party.nucleus.io.clif_postproc
Thank you!
I0624 03:06:10.781211 47766986753856 postprocess_variants.py:971] Using sample name from call_variants output. Sample name: [redacted]
2023-06-24 03:06:10.786312: I deepvariant/postprocess_variants.cc:88] Read from: /tmp/tmpncogiwdj/call_variants_output.tfrecord.gz
2023-06-24 03:06:56.226107: I deepvariant/postprocess_variants.cc:103] Total #entries in single_site_calls = 11112971
I0624 03:08:29.075627 47766986753856 postprocess_variants.py:1036] CVO sorting took 2.3048147598902387 minutes
I0624 03:08:29.077477 47766986753856 postprocess_variants.py:1039] Transforming call_variants_output to variants.
I0624 03:33:55.377552 47766986753856 postprocess_variants.py:1079] Processing variants (and writing to temporary file) took 25.438181654612222 minutes
I0624 03:47:44.937586 47766986753856 postprocess_variants.py:1092] Finished writing VCF and gVCF in 13.825993386904399 minutes.
I0624 03:47:44.998353 47766986753856 genomics_reader.py:222] Reading /path/to/redacted.cpu.dv.vcf.gz with NativeVcfReader
I0624 03:54:05.123418 47766986753856 postprocess_variants.py:1103] Generating VCF stats took 6.336423444747925 minutes.
real 47m58.453s
user 47m8.501s
sys 0m48.710s
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/absl_py/absl/app.py", line 300, in run
File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/absl_py/absl/app.py", line 251, in _run_main
File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1099, in main
File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/com_google_deepvariant/third_party/nucleus/io/genomics_reader.py", line 244, in iterate
File "/tmp/Bazel.runfiles_hbm44wtf/runfiles/com_google_deepvariant/third_party/nucleus/io/vcf.py", line 201, in iterate
ModuleNotFoundError: No module named 'third_party.nucleus.io.clif_postproc'
real 38m25.744s
user 37m57.789s
sys 0m26.734s
Edit: This was in postprocess_variants.log
it does appear that call_variants.log
and make_examples.log
look good, and indexed vcf files were created (that visually look good as well, I haven't dug into them)
Hi,
Sorry about the delayed response. I'm not too sure where you would find that, I believe that is downloaded when DeepVariant is built based on some Googling. Here is an (old) closed issue from DeepVariant where I think the person had a similar issue to yours I think: https://github.com/google/deepvariant/issues/103
I'm going to close this issue. Please let me know if you have further problems.
I was able to get this working! The original error was because the tmp directory was full.
When that was fixed, I got a separate internal issue because I can't use python2 and python3 at the same time (when you loaded the python3 module on our system it messed up a behind-the-hood file for python2 so the final scripts I pushed through by hand).
Which leads me to: I see for the CPG file and _position file in your described outputs you're grepping lines that have a comment and piping them to the file, but I don't see any output going there? It could be because I'm copy pasting.
But I do see the de novos that I need(!!) so I'm not concerned about this per-se.
Thank you so much!
Hello!
I am running Tortoise on pacbio long read trio to call de novo variants. It ran for about 2.1 days, and I think made it very far! From what I can see the haplotype calling finished:
although I see a lot of rows like:
15:14:51.985 WARN DepthPerSampleHC - Annotation will not be calculated at position chr1:10616 and possibly subsequent; genotype for sample PBG_3484_0000009718 is not called
This is ultimately why I think it finished successfully:
But the biggest issue from my perspective is this error that appeared (for each sample analyzed):
Is there a way I can get this to work without making new bamfiles to remove all of the contigs that are not referenced in the dictionary? Thank you for checking/helping me debug!