Open dwbellott opened 1 month ago
Thanks for reporting bugs! It seems that the correction/polishing module went wrong (might be due to short tandem repeats -- just my guess).
For a quick fix, could you exec jtk
with to_polish = false
in your config.toml
?
Since the polishing step never changes the topology of the graph, if the resulting graph is too complicated/having many short nodes, I guess jtk
could not handle the dataset (regardless of the polishing module panics or not).
Thanks for the quick response, I tried what you suggested, but ended up getting the same error:
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3782 2 1.000 1.000
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3786 2 1.000 1.000
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3789 1 0.000 0.000
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3793 2 1.000 1.000
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 6 but the index is 6
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3814 1 0.000 0.000
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3815 2 1.000 1.000
Thanks for the clarification. I will dig this issue on this weekend
A little update -- I did manage to get jtk to run on a smaller dataset (only spanning 2Mb instead of 10Mb), so whatever is causing the panics is probably related to dataset size. You may want to adjust your recommendations to users.
I did notice that the resulting haplotypes are not as complete as I might have hoped -- my read n50 is about 100kb, and the n50 for the assembled haplotypes is about 200kb, the longest haplotype is 459kb, my longest read in the input data is 591kb. Loading up both the reads and the assembled contigs in IGV, I can see that the resulting haplotypes are missing about half the heterozygous SNVs, and about half of the SNVs that are present in the assemblies are mistakenly called as homozygous.
Are there some settings I can tweak to increase the sensitivity to SNVs (and perhaps the contiguity as well)?
Thanks for the information. Also sorry to hear that essentially our program did not "assemble" anything useful on your data.
On my side, I am still investigating the situation. I re-run the latest build on the Chr1:10M-15M nt region & the MHC region of the HG002 to see that jtk
fully resolved these regions. The coverage or the complexity of the region might be the source of issue....
If you have time, could you use the latest build on your data with verbose = 2
, and tell me the full log? You can use it either
docker run --rm public.ecr.aws/r1e4j1j8/jtk:ac8d6ca2c055cfa2839258aaacde78960d51946c
(recommended)git clone -b refactor https://github.com/ban-m/jtk.git && cd jtk && cargo build --release && ./target/release/jtk
Sure, thanks for the help! Sorry for the delay in responding.
I ran your latest build on my data in three regions (using hs1 as the reference): chr1:10,000,000-15,000,000 (this crashed early this morning with an OOM error; I'll restart it, but it might take a while). chr6:28,874,808-32,874,807 (the MHC region) chr22:20,000,000-22,000,000 (the region I was interested in)
With the new version I do get distinct haplotypes covering almost all of the MHC region, though it doesn't give distinct haplotypes over the olfactory receptor cluster near the beginning of the region.
The new version does call distinct haplotypes over more of my region (about 600kb out of 2Mb?) but it misses a lot. The longest haplotype call does contain my longest read (but is only ~5kb longer).
In both regions, If I open up the haplotypes and the reads in IGV, I still notice that the SNVs are still not complete or accurate (they don't seem much better).
Thanks a lot! One thing I noticed is that the coverage seemed to be less than 30, which I do not recommend (I used 60x coverage, 30x for each haplotype in the published article). But I know providing options for shallow coverage would be helpful ... please let me think a bit.
Yes, I do only have about 16x coverage, that's why I was curious if there were any parameters I could tweak. Thanks for giving this some thought!
I was able to run jtk successfully with the example reads, but I ran into an issue using my own nanopore reads.
Here is the tail of the output, running with
verbose = 2
jtk panicked at this same point 20 times before it failed (although I am running with
threads = 1
) . If I just grep for 'panic' in the output I can see:Following issue #8, I checked to see if I had any duplicate read names, but I do not.
Any idea what is causing this?
Thanks!