bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
90 stars 18 forks source link

Incorrect merge reporting from novel queries step #268

Closed johnlees closed 1 year ago

johnlees commented 1 year ago

Versions

master poppunk sketchlib 2.1.1

Command used and output returned

/home/shorsfield/software/PopPUNK/poppunk_assign-runner.py --db /media/mirrored-hdd/shorsfield/jobs/poppunk-models/Salmonella/sal_sketch40k --query /media/mirrored-hdd/shorsfield/jobs/poppunk-models/Salmonella/Salmonella_query_popPUNK_ids_1k.txt --model-dir /media/mirrored-hdd/shorsfield/jobs/poppunk-models/Salmonella/dbscan_indiv_refine_both --core --threads 40 --output /media/mirrored-hdd/shorsfield/jobs/poppunk-models/Salmonella/query_assignments_1k_master

from dir /media/mirrored-hdd/shorsfield/jobs/poppunk-models/Salmonella/query_assignments_1k_master

Describe the bug After 'novel queries' reports all original clusters as merged. Output seems ok, and report from earlier merging is also fine.

johnlees commented 1 year ago

This looks like a 'works on my machine' issue:

python ~/installs/PopPUNK/poppunk_assign-runner.py --db /media/mirrored-hdd/shorsfield/jobs/poppunk-models/Salmonella/sal_sketch40k --output /media/mirrored-hdd/jlees/salmonella_assign --threads 8 --query Salmonella_query_popPUNK_ids_1k.txt --model-dir dbs
can_indiv_refine_both --update-db
PopPUNK: assign
        (with backend: sketchlib v2.1.0
         sketchlib: /home/jlees/miniconda3/envs/pp-py39/lib/python3.9/site-packages/pp_sketchlib-2.1.1-py3.9-linux-x86_64.egg/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences

Graph-tools OpenMP parallelisation enabled: with 8 threads
Looking for existing sketches in /media/mirrored-hdd/jlees/salmonella_assign/salmonella_assign.h5
Loading previously refined model
Completed model loading
WARNING: versions of input databases sketches are different, results may not be compatible
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%
Loading network from dbscan_indiv_refine_both/dbscan_indiv_refine_both_graph.gt
Network loaded: 48180 samples
Calculating all query-query distances
Using existing random match chances in DB
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%
Clusters 2,1235 have merged into 2_1235
Clusters 5,1003 have merged into 5_1003
Clusters 6,737 have merged into 6_737
Clusters 32,784 have merged into 32_784
Clusters 72,1368 have merged into 72_1368
Clusters 105,1766 have merged into 105_1766
Clusters 152,795 have merged into 152_795
Clusters 206,1390,1473,1678 have merged into 206_1390_1473_1678
Clusters 232,1700 have merged into 232_1700
Updating reference database to /media/mirrored-hdd/jlees/salmonella_assign
Updating random match chances
Calculating random match chances using Monte Carlo
Removing 41780 sequences

Done

from /media/mirrored-hdd/shorsfield/jobs/poppunk-models/Salmonella