bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
86 stars 17 forks source link

distance calculations hang forever #316

Closed rderelle closed 1 day ago

rderelle commented 2 days ago

Hi John,

I'm trying here to use popPUNK to classify 125k Mtb genomes.

Versions poppunk 2.6.5 installed with Conda

Command used and output returned sketchlib sketch -l list_files_poppunk.txt -o poppunk2 -s 10000 -k 17,29,4 --cpus 12 sketchlib query dist poppunk2 -o dist2 --cpus 16

Describe the bug The first command worked well and created the file "poppunk2.h5". However the 2nd command seems to be to hanging forever without creating any output file (I tried twice with different numbers of CPUs). Here is the shell output:

Calculating distances using 16 thread(s) Progress (CPU): 3.3% Progress (CPU): 6.7% Progress (CPU): 10.1% Progress (CPU): 13.3% Progress (CPU): 16.6% Progress (CPU): 19.8% Progress (CPU): 23.1% Progress (CPU): 26.3% Progress (CPU): 29.6% Progress (CPU): 32.8% Progress (CPU): 36.1% Progress (CPU): 39.3% Progress (CPU): 42.6% Progress (CPU): 45.8% Progress (CPU): 49.1% Progress (CPU): 52.3% Progress (CPU): 55.6% Progress (CPU): 58.8% Progress (CPU): 60.4%No non-zero Jaccard distances Fitting k-mer gradient failed, for:SAMEA5875845vs.SAMN03253058 0.00400641 0.000300481 0.000200321 0.000200321

Check for low quality genomes Progress (CPU): 63.6% Progress (CPU): 66.9% Progress (CPU): 70.1% Progress (CPU): 73.4% Progress (CPU): 76.6% Progress (CPU): 79.9% Progress (CPU): 83.1% Progress (CPU): 86.4% Progress (CPU): 89.6% Progress (CPU): 92.9% Progress (CPU): 96.1% Progress (CPU): 99.4% Progress (CPU): 100.0%

After that the job hangs for hours without output. Any help would be much appreciated as I'm currently stuck with this issue.

Many thanks, Romain

johnlees commented 2 days ago

This error should cause a crash, but might fail to do some on some sketchlib versions. Which version of sketchlib do you have?

Anyway, the issue is that one/both of SAMEA5875845 and/or SAMN03253058 need to be removed as they share no k-mers.

johnlees commented 2 days ago

Also for Mtb you may want to increase the sketch size to 10^5

rderelle commented 2 days ago

Thanks! I will then check these 2 samples. For information, I'm using pp-sketchlib v2.1.4. Also I'll increase the sketch size.

johnlees commented 2 days ago

I'm using pp-sketchlib v2.1.4.

That should be fine, but stopping the parallel code hasn't always been particularly reliable sorry!

johnlees commented 2 days ago

I would also suggest doing an initial test set of ~10k to get it working, and estimate how long the full analysis will take

rderelle commented 2 days ago

I increased the sketch size to 100000 and removed 6 genomes not classified as Mtb by any other method (including SAMEA5875845).

sketching took 50 mn -> 81G file. distances is taking about 10 mn per 1% -> estimated computational time of 16h, which is fine.

Thanks a lot.

rderelle commented 1 day ago

The distance calculations have successfully finished. Thanks.