Running Poppunk v2.5.0 with --multi-boundry --> OverflowError

avonm commented 1 year ago

Versions Poppunk v2.5.0 PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v2.0.0 sketchlib: /opt/conda/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Command used and output returned poppunk --fit-model refine --model-dir /tmp/Ecoli_n79k_QCd_dbscan_k18_k32 --ref-db /tmp/Ecoli_n79k_db_k18_k32_221115 --output /tmp/Ecoli_n79k_QCd_dbscan_refine_multi_1 --multi-boundary 30 --threads 20

Describe the bug Below is the output I get, both the process of the run and the error. The plan is to run poppunk_iterate after this.

Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database

Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Search range (0.001,0.057) to (0.014,0.304) Searching core intercept from 0.006 to 0.042 Searching accessory intercept from 0.064 to 0.448 █████████████████████████████████| 40/40 Trying to optimise score locally

Optimization terminated successfully; The returned value satisfies the termination criteria (using xtol = 1e-05 ) Creating multiple boundary fits Search range (0.000,0.044) to (0.006,0.164) Searching core intercept from 0.004 to 0.022 Searching accessory intercept from 0.044 to 0.231 █▏ | 1/30 Traceback (most recent call last): File "/opt/conda/bin/poppunk", line 11, in sys.exit(main()) File "/opt/conda/lib/python3.9/site-packages/PopPUNK/main.py", line 469, in main assignments = new_model.fit(distMat, refList, model, File "/opt/conda/lib/python3.9/site-packages/PopPUNK/models.py", line 808, in fit multi_refine(scaled_X, File "/opt/conda/lib/python3.9/site-packages/PopPUNK/refine.py", line 296, in multi_refine growNetwork(sample_names, File "/opt/conda/lib/python3.9/site-packages/PopPUNK/refine.py", line 442, in growNetwork G.add_edge_list(edge_list) File "/opt/conda/lib/python3.9/site-packages/graph_tool/init.py", line 2501, in add_edge_list libcore.add_edge_list_iter(self.__graph, edge_list, eprops) OverflowError: can't convert negative value to unsigned int

johnlees commented 1 year ago

Seems like this is probably a bug. The search range starting at 0.000 might be the problem. Does it work with:

a smaller dataset?
another input model?
refine, without multi-boundary?
Using a manual start position?

It would also be helpful to see the output plots

avonm commented 1 year ago

Hi John, I am running refine now, it has worked before so hoping it works this time too. I'll also try running multi-boundary with a smaller dataset. I've attached the plots I have so far. This includes plots from creating the database, model-fit and the plot from running refine with the --multi-boundary flag Ecoli_n79k_QCd_dbscan_refine_multi_1_refined_fit Ecoli_n79k_QCd_dbscan_k18_k32_dbscan

Ecoli_n79k_db_k18_k32_221115_distanceDistribution Ecoli_n79k_db_k18_k32_221115_fit_example_1.pdf Ecoli_n79k_db_k18_k32_221115_fit_example_2.pdf Ecoli_n79k_db_k18_k32_221115_fit_example_3.pdf Ecoli_n79k_db_k18_k32_221115_fit_example_4.pdf Ecoli_n79k_db_k18_k32_221115_fit_example_5.pdf

johnlees commented 1 year ago

This is E. coli right? Broadly, it looks good

I would suggest:

Run --qc-db to remove the points with accessory distances > 0.6
Move the upper range down. If you find the second cluster (in blue, around (0.004, 0.18)) from your DBSCAN fit output, and use that as mean1 with --manual-start in the refinement

A difficulty is that (especially with a large dataset) there are many large strains and you end up with a large network (i.e. the contour line around the origin).

This could well be a bug in the multi-boundary code, but I'd need a similar thing with a smaller dataset to reproduce and fix I think

avonm commented 1 year ago

Refine works with the database where multi-boundary failed. See below for output. Do you then think it might be a bug in multi-boundary code based on this? I am trying with a smaller dataset to run multi-boundary too. So I'll get back to you when that is done.

`Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database

Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Searching core intercept from 0.006 to 0.042 Searching accessory intercept from 0.064 to 0.448 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20 █████████████████████████████████| 20/20^[[A^[[A^[[B^[[B Trying to optimise score locally | 18/20 ███████████████████████████████▎ | 19/20 Optimization terminated successfully;/20 The returned value satisfies the termination criteria (using xtol = 1e-05 )███████████| 20/20 Network summary:████████████ | 17/20 ████████Components███████████████| 20/20 6740 ████████Density█████████████ | 17/20 0.0370 ████████Transitivity█████████████| 20/20 0.9812 ████████Mean betweenness█████▋ | 18/20 0.1650 ████████Weighted-mean betweenness| 20/20 0.0828 ████████Score████████████████▋ | 18/20 0.9448 ████████Score (w/ betweenness)███| 20/20 0.7890 ████████Score (w/ weighted-betweenness)0 0.8666 Removing 65155 sequences█████████| 20/20 █████████████████████████████▋ | 18/20 Done█████████████████████████████| 20/20`

johnlees commented 1 year ago

Yeah I'm guessing we have a bug there – please do let me know if there's a smaller dataset we can try and reproduce this and I can try and fix it

avonm commented 1 year ago

I ran the same commands on a smaller dataset (10k genomes) and it worked with multi-boundary. See output below. Will try and run on the full dataset after tweaking the QC. Maybe the potential bug is related to a large dataset?

`avm@node-14-26:/tmp$ module load poppunk/2.5.0-c2 Module loaded. For more information run 'module help poppunk/2.5.0-c2'. avm@node-14-26:/tmp$ poppunk --fit-model refine --model-dir /tmp/Ecoli_n79k_db_k18_k32_n10k_QCd_acc_dist_dbscan_221206 --ref-db /tmp/Ecoli_n79k_db_k18_k32_n10k_QCd_acc_dist_221206 --output /tmp/Ecoli_n79k_db_k18_k32_n10k_QCd_acc_dist_dbscan_multi_221206 --multi-boundary 30 --threads 20 PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v2.0.0 sketchlib: /opt/conda/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database

Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Search range (0.001,0.056) to (0.014,0.304) Searching core intercept from 0.006 to 0.041 Searching accessory intercept from 0.064 to 0.455 █████████████████████████████████| 40/40 Trying to optimise score locally

Optimization terminated successfully; The returned value satisfies the termination criteria (using xtol = 1e-05 ) Creating multiple boundary fits Search range (0.000,0.044) to (0.004,0.112) Searching core intercept from 0.004 to 0.014 Searching accessory intercept from 0.044 to 0.152 █████████████████████████████████| 30/30 Network summary: Components 1209 Density 0.0431 Transitivity 0.9970 Mean betweenness 0.2072 Weighted-mean betweenness 0.0877 Score 0.9541 Score (w/ betweenness) 0.7564 Score (w/ weighted-betweenness) 0.8704 Removing 7840 sequences

Done`

avonm commented 1 year ago

Just want to let you know that poppunk with multi?boundary now worked: `avm@node-14-26:/tmp$ module load poppunk/2.5.0-c2 Module loaded. For more information run 'module help poppunk/2.5.0-c2'. avm@node-14-26:/tmp$ poppunk --fit-model refine --model-dir /tmp/Ecoli_n79k_db_k18_k32_QCd_acc_dist_dbscan_221206 --ref-db /tmp/Ecoli_n79k_db_k18_k32_QCd_acc_dist_221130 --output /tmp/Ecoli_n79k_db_k18_k32_QCd_acc_dist_dbscan_multi_221206 --multi-boundary 30 --threads 20 PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v2.0.0 sketchlib: /opt/conda/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database

Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Search range (0.001,0.057) to (0.014,0.304) Searching core intercept from 0.006 to 0.040 Searching accessory intercept from 0.065 to 0.461 ███████████████████████████████████| 1/1 Creating multiple boundary fits Search range (0.000,0.044) to (0.001,0.057) Searching core intercept from 0.004 to 0.006 Searching accessory intercept from 0.044 to 0.065 █████████████████████████████████| 30/30 Network summary: Components 13331 Density 0.0227 Transitivity 0.8491 Mean betweenness 0.1724 Weighted-mean betweenness 0.0735 Score 0.8299 Score (w/ betweenness) 0.6868 Score (w/ weighted-betweenness) 0.7689 Removing 54898 sequences

Done `

johnlees commented 1 year ago

Ok! So perhaps a resource issue in the first try?

bacpop / PopPUNK

Running Poppunk v2.5.0 with --multi-boundry --> OverflowError #248