Closed avonm closed 7 months ago
Seems like this is probably a bug. The search range starting at 0.000 might be the problem. Does it work with:
It would also be helpful to see the output plots
Hi John, I am running refine now, it has worked before so hoping it works this time too. I'll also try running multi-boundary with a smaller dataset. I've attached the plots I have so far. This includes plots from creating the database, model-fit and the plot from running refine with the --multi-boundary flag
Ecoli_n79k_db_k18_k32_221115_fit_example_1.pdf Ecoli_n79k_db_k18_k32_221115_fit_example_2.pdf Ecoli_n79k_db_k18_k32_221115_fit_example_3.pdf Ecoli_n79k_db_k18_k32_221115_fit_example_4.pdf Ecoli_n79k_db_k18_k32_221115_fit_example_5.pdf
This is E. coli right? Broadly, it looks good
I would suggest:
--qc-db
to remove the points with accessory distances > 0.6--manual-start
in the refinementA difficulty is that (especially with a large dataset) there are many large strains and you end up with a large network (i.e. the contour line around the origin).
This could well be a bug in the multi-boundary code, but I'd need a similar thing with a smaller dataset to reproduce and fix I think
Refine works with the database where multi-boundary failed. See below for output. Do you then think it might be a bug in multi-boundary code based on this? I am trying with a smaller dataset to run multi-boundary too. So I'll get back to you when that is done.
`Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database
Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Searching core intercept from 0.006 to 0.042 Searching accessory intercept from 0.064 to 0.448 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20 βββββββββββββββββββββββββββββββββ| 20/20^[[A^[[A^[[B^[[B Trying to optimise score locally | 18/20 ββββββββββββββββββββββββββββββββ | 19/20 Optimization terminated successfully;/20 The returned value satisfies the termination criteria (using xtol = 1e-05 )βββββββββββ| 20/20 Network summary:ββββββββββββ | 17/20 ββββββββComponentsβββββββββββββββ| 20/20 6740 ββββββββDensityβββββββββββββ | 17/20 0.0370 ββββββββTransitivityβββββββββββββ| 20/20 0.9812 ββββββββMean betweennessββββββ | 18/20 0.1650 ββββββββWeighted-mean betweenness| 20/20 0.0828 ββββββββScoreβββββββββββββββββ | 18/20 0.9448 ββββββββScore (w/ betweenness)βββ| 20/20 0.7890 ββββββββScore (w/ weighted-betweenness)0 0.8666 Removing 65155 sequencesβββββββββ| 20/20 ββββββββββββββββββββββββββββββ | 18/20 Doneβββββββββββββββββββββββββββββ| 20/20`
Yeah I'm guessing we have a bug there β please do let me know if there's a smaller dataset we can try and reproduce this and I can try and fix it
I ran the same commands on a smaller dataset (10k genomes) and it worked with multi-boundary. See output below. Will try and run on the full dataset after tweaking the QC. Maybe the potential bug is related to a large dataset?
`avm@node-14-26:/tmp$ module load poppunk/2.5.0-c2 Module loaded. For more information run 'module help poppunk/2.5.0-c2'. avm@node-14-26:/tmp$ poppunk --fit-model refine --model-dir /tmp/Ecoli_n79k_db_k18_k32_n10k_QCd_acc_dist_dbscan_221206 --ref-db /tmp/Ecoli_n79k_db_k18_k32_n10k_QCd_acc_dist_221206 --output /tmp/Ecoli_n79k_db_k18_k32_n10k_QCd_acc_dist_dbscan_multi_221206 --multi-boundary 30 --threads 20 PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v2.0.0 sketchlib: /opt/conda/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)
Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database
Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Search range (0.001,0.056) to (0.014,0.304) Searching core intercept from 0.006 to 0.041 Searching accessory intercept from 0.064 to 0.455 βββββββββββββββββββββββββββββββββ| 40/40 Trying to optimise score locally
Optimization terminated successfully; The returned value satisfies the termination criteria (using xtol = 1e-05 ) Creating multiple boundary fits Search range (0.000,0.044) to (0.004,0.112) Searching core intercept from 0.004 to 0.014 Searching accessory intercept from 0.044 to 0.152 βββββββββββββββββββββββββββββββββ| 30/30 Network summary: Components 1209 Density 0.0431 Transitivity 0.9970 Mean betweenness 0.2072 Weighted-mean betweenness 0.0877 Score 0.9541 Score (w/ betweenness) 0.7564 Score (w/ weighted-betweenness) 0.8704 Removing 7840 sequences
Done`
Just want to let you know that poppunk with multi?boundary now worked: `avm@node-14-26:/tmp$ module load poppunk/2.5.0-c2 Module loaded. For more information run 'module help poppunk/2.5.0-c2'. avm@node-14-26:/tmp$ poppunk --fit-model refine --model-dir /tmp/Ecoli_n79k_db_k18_k32_QCd_acc_dist_dbscan_221206 --ref-db /tmp/Ecoli_n79k_db_k18_k32_QCd_acc_dist_221130 --output /tmp/Ecoli_n79k_db_k18_k32_QCd_acc_dist_dbscan_multi_221206 --multi-boundary 30 --threads 20 PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v2.0.0 sketchlib: /opt/conda/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)
Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database
Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Search range (0.001,0.057) to (0.014,0.304) Searching core intercept from 0.006 to 0.040 Searching accessory intercept from 0.065 to 0.461 βββββββββββββββββββββββββββββββββββ| 1/1 Creating multiple boundary fits Search range (0.000,0.044) to (0.001,0.057) Searching core intercept from 0.004 to 0.006 Searching accessory intercept from 0.044 to 0.065 βββββββββββββββββββββββββββββββββ| 30/30 Network summary: Components 13331 Density 0.0227 Transitivity 0.8491 Mean betweenness 0.1724 Weighted-mean betweenness 0.0735 Score 0.8299 Score (w/ betweenness) 0.6868 Score (w/ weighted-betweenness) 0.7689 Removing 54898 sequences
Done `
Ok! So perhaps a resource issue in the first try?
Versions Poppunk v2.5.0 PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v2.0.0 sketchlib: /opt/conda/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)
Command used and output returned poppunk --fit-model refine --model-dir /tmp/Ecoli_n79k_QCd_dbscan_k18_k32 --ref-db /tmp/Ecoli_n79k_db_k18_k32_221115 --output /tmp/Ecoli_n79k_QCd_dbscan_refine_multi_1 --multi-boundary 30 --threads 20
Describe the bug Below is the output I get, both the process of the run and the error. The plan is to run poppunk_iterate after this.
Graph-tools OpenMP parallelisation enabled: with 20 threads Mode: Fitting refine model to reference database
Loading DBSCAN model Completed model loading Loaded previous model of type: dbscan Initial model-based network construction based on DBSCAN fit Trying to optimise score globally Search range (0.001,0.057) to (0.014,0.304) Searching core intercept from 0.006 to 0.042 Searching accessory intercept from 0.064 to 0.448 βββββββββββββββββββββββββββββββββ| 40/40 Trying to optimise score locally
Optimization terminated successfully; The returned value satisfies the termination criteria (using xtol = 1e-05 ) Creating multiple boundary fits Search range (0.000,0.044) to (0.006,0.164) Searching core intercept from 0.004 to 0.022 Searching accessory intercept from 0.044 to 0.231 ββ | 1/30 Traceback (most recent call last): File "/opt/conda/bin/poppunk", line 11, in
sys.exit(main())
File "/opt/conda/lib/python3.9/site-packages/PopPUNK/main.py", line 469, in main
assignments = new_model.fit(distMat, refList, model,
File "/opt/conda/lib/python3.9/site-packages/PopPUNK/models.py", line 808, in fit
multi_refine(scaled_X,
File "/opt/conda/lib/python3.9/site-packages/PopPUNK/refine.py", line 296, in multi_refine
growNetwork(sample_names,
File "/opt/conda/lib/python3.9/site-packages/PopPUNK/refine.py", line 442, in growNetwork
G.add_edge_list(edge_list)
File "/opt/conda/lib/python3.9/site-packages/graph_tool/init.py", line 2501, in add_edge_list
libcore.add_edge_list_iter(self.__graph, edge_list, eprops)
OverflowError: can't convert negative value to unsigned int