chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
540 stars 87 forks source link

Histogram count goes crazy at ultra-low coverage #62

Open ASLeonard opened 3 years ago

ASLeonard commented 3 years ago

Hi, To preface this, I know this is not a standard use case, so I'm not expecting there to be a happy resolution.

I've been trying to test how low the coverage can go to see which metrics (auN, asmgene, QV, etc) start to break when, on a 2.7gb mammal genome.

I had an assembly complete fine down to 12x within 60 CPU hours and 60GB peak ram. The initial hist ended with these values

[M::ha_hist_line]    32: * 63612
[M::ha_hist_line]    33: * 56286
[M::ha_hist_line]    34: * 45328

The next step was down to 9x coverage, which crashed after 4 CPU hours when it hit the 108GB ram I requested. The initial hist values ended with

[M::ha_hist_line]  4090:  15
[M::ha_hist_line]  4091:  21
[M::ha_hist_line]  4092:  15
[M::ha_hist_line]  4093:  21
[M::ha_hist_line]  4094:  19
[M::ha_hist_line]  4095: **************************************************************************************************** 91824
[M::ha_hist_line]  rest:  0
[M::ha_analyze_count] left: count[100] = 39120
[M::ha_analyze_count] right: none
[M::ha_ft_gen] peak_hom: 4095; peak_het: 100
[M::ha_ft_gen::649.463*11.00@49.199GB] ==> filtered out 91843 k-mers occurring 4094 or more times
[M::ha_opt_update_cov] updated max_n_chain to 20475
[M::ha_pt_gen::1139.165*8.16] ==> counted 103014757 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)

Interestingly, the last line [M::ha_pt_gen] count[4095] = 0 (for sanity check) doesn't match the count of [M::ha_hist_line] 4095: 91824.

I would guess the low coverage is messing with the kmer counting, but I was surprised hifiasm worked smoothly from 40x to 12x and then totally broke at 9x. I could only find an (-D) for dropping frequent kmers, but couldn't see anything for dropping infrequent (maybe fewer bits for the bloom filter via -f?).

Do you think this is just a (very reasonably) coverage limit that can't be crossed, or do you think there are some settings to adjust to force a result?

Thanks, Alex

chhylp123 commented 3 years ago

Sorry for the late reply. Is it possible that yo can show the data at 9x coverage?

ASLeonard commented 3 years ago

By show the data at 9x coverage, do you mean posting some summary statistics for the data, or sharing that data with you via some ftp?

chhylp123 commented 3 years ago

Sharing that data via ftp might be better, so that we can do some debugging. I cannot make sure the exact reason for now.

ASLeonard commented 3 years ago

I've shared the data with your listed email address

chhylp123 commented 3 years ago

Got the email. Thanks a lot.

ASLeonard commented 3 years ago

Just tried reassembling the same data with the the latest version, but without much improvement. The histogram still has a strange distribution (same as in #66) and the job gets killed after using 2x as much RAM as higher coverage datasets which were successful.

chhylp123 commented 3 years ago

Please wait me a few days. I'm debugging this problem on the dataset you sent to me. I need to release v0.14 first since it has many small bugs.. Sorry for the delay.

ASLeonard commented 3 years ago

No worries, I'm in no rush. This was just testing the limits of low coverage rather than producing a primary assembly.

tallnuttrbgv commented 6 months ago

Was this resolved? I am seeing the same problem on a large genome (5 Gbp) but have about 40X reads). Exactly the same strange histogram and hifiasm either times out or runs out of memory.