ban-m / jtk

JTK -- a regional diploid genome assembler
MIT License
23 stars 2 forks source link

panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9 #10

Open dwbellott opened 1 month ago

dwbellott commented 1 month ago

I was able to run jtk successfully with the example reads, but I ran into an issue using my own nanopore reads.

Here is the tail of the output, running with verbose = 2

[2024-10-11T12:04:33Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3774    2   1.000   1.000
[2024-10-11T12:04:33Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3775    1   0.000   0.000
[2024-10-11T12:04:33Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3776    2   1.000   1.000
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 6 but the index is 6
[2024-10-11T12:04:33Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3788    2   1.000   1.000
[2024-10-11T12:04:33Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3790    2   1.000   1.000
[2024-10-11T12:04:33Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3794    2   1.000   1.000
[2024-10-11T12:04:33Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3797    2   1.000   1.000
[2024-10-11T12:04:33Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3811    1   0.000   0.000

jtk panicked at this same point 20 times before it failed (although I am running with threads = 1) . If I just grep for 'panic' in the output I can see:

thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 5 but the index is 5
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 6 but the index is 6
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 5 but the index is 5
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 5 but the index is 5
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 5 but the index is 5
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 6 but the index is 6
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 5 but the index is 5
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 5 but the index is 5
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 6 but the index is 6
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 6 but the index is 6
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 5 but the index is 5
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 5 but the index is 5
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 6 but the index is 6

Following issue #8, I checked to see if I had any duplicate read names, but I do not.

Any idea what is causing this?

Thanks!

ban-m commented 1 month ago

Thanks for reporting bugs! It seems that the correction/polishing module went wrong (might be due to short tandem repeats -- just my guess).

For a quick fix, could you exec jtk with to_polish = false in your config.toml ?

Since the polishing step never changes the topology of the graph, if the resulting graph is too complicated/having many short nodes, I guess jtk could not handle the dataset (regardless of the polishing module panics or not).

dwbellott commented 1 month ago

Thanks for the quick response, I tried what you suggested, but ended up getting the same error:

[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3782    2   1.000   1.000
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3786    2   1.000   1.000
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3789    1   0.000   0.000
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3793    2   1.000   1.000
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 6 but the index is 6
thread '<unnamed>' panicked at haplotyper/src/phmm_likelihood_correction.rs:353:9:
index out of bounds: the len is 7 but the index is 7
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3814    1   0.000   0.000
[2024-10-16T06:20:26Z DEBUG haplotyper::phmm_likelihood_correction] ARI 3815    2   1.000   1.000
ban-m commented 1 month ago

Thanks for the clarification. I will dig this issue on this weekend

dwbellott commented 1 month ago

A little update -- I did manage to get jtk to run on a smaller dataset (only spanning 2Mb instead of 10Mb), so whatever is causing the panics is probably related to dataset size. You may want to adjust your recommendations to users.

I did notice that the resulting haplotypes are not as complete as I might have hoped -- my read n50 is about 100kb, and the n50 for the assembled haplotypes is about 200kb, the longest haplotype is 459kb, my longest read in the input data is 591kb. Loading up both the reads and the assembled contigs in IGV, I can see that the resulting haplotypes are missing about half the heterozygous SNVs, and about half of the SNVs that are present in the assemblies are mistakenly called as homozygous.

Are there some settings I can tweak to increase the sensitivity to SNVs (and perhaps the contiguity as well)?

ban-m commented 1 month ago

Thanks for the information. Also sorry to hear that essentially our program did not "assemble" anything useful on your data.

On my side, I am still investigating the situation. I re-run the latest build on the Chr1:10M-15M nt region & the MHC region of the HG002 to see that jtk fully resolved these regions. The coverage or the complexity of the region might be the source of issue....

If you have time, could you use the latest build on your data with verbose = 2, and tell me the full log? You can use it either

  1. docker run --rm public.ecr.aws/r1e4j1j8/jtk:ac8d6ca2c055cfa2839258aaacde78960d51946c (recommended)
  2. git clone -b refactor https://github.com/ban-m/jtk.git && cd jtk && cargo build --release && ./target/release/jtk
dwbellott commented 1 month ago

Sure, thanks for the help! Sorry for the delay in responding.

I ran your latest build on my data in three regions (using hs1 as the reference): chr1:10,000,000-15,000,000 (this crashed early this morning with an OOM error; I'll restart it, but it might take a while). chr6:28,874,808-32,874,807 (the MHC region) chr22:20,000,000-22,000,000 (the region I was interested in)

With the new version I do get distinct haplotypes covering almost all of the MHC region, though it doesn't give distinct haplotypes over the olfactory receptor cluster near the beginning of the region.

The new version does call distinct haplotypes over more of my region (about 600kb out of 2Mb?) but it misses a lot. The longest haplotype call does contain my longest read (but is only ~5kb longer).

In both regions, If I open up the haplotypes and the reads in IGV, I still notice that the SNVs are still not complete or accurate (they don't seem much better).

chr22.output.txt mhc.output.txt

ban-m commented 3 weeks ago

Thanks a lot! One thing I noticed is that the coverage seemed to be less than 30, which I do not recommend (I used 60x coverage, 30x for each haplotype in the published article). But I know providing options for shallow coverage would be helpful ... please let me think a bit.

dwbellott commented 3 weeks ago

Yes, I do only have about 16x coverage, that's why I was curious if there were any parameters I could tweak. Thanks for giving this some thought!