bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
88 stars 18 forks source link

Problem fitting Salmonella Typhimurium database- convergence error #183

Closed andreaniml closed 2 years ago

andreaniml commented 2 years ago

Hello!

Thanks for the amazing tool! Looking forward to using it on my analyses, but I need a little help

Versions poppunk 2.4.0 poppunk_sketch 1.7.3

Command used and output returned After creating a database with: poppunk --create-db --output Salmonella_mink15_maxk35_sketch_size100000 --r-files reference_list.txt --threads 4 --mi n-k 15 --max-k 35 --sketch-size 100000 --plot-fit 5

ran: poppunk --fit-model bgmm --ref-db Salmonella_mink15_maxk35_sketch_size100000 --output Teste_com_K8 --K 8 poppunk --fit-model bgmm --ref-db Salmonella_mink15_maxk35_sketch_size100000 --output Teste_com_K12b --K 12 -- threads 6 poppunk --fit-model bgmm --ref-db Salmonella_mink15_maxk35_sketch_size100000 --output Teste_com_K5 --K 5

Describe the bug I am trying to fit a database of 456 genomes of Salmonella Typhimurium to a bgmm model and there are some issues.

First, I'm having problems with hitting good score:

If I run the program with --K 8 I get this scores:

Network summary:

    Components                              56
    Density                                 0.1040
    Transitivity                            0.8277
    Mean betweenness                        0.2590
    Weighted-mean betweenness               0.2266
    Score                                   0.7417
    Score (w/ betweenness)                  0.5496
    Score (w/ weighted-betweenness)         0.5736

Removing 307 sequences

If I raise this to --K 12 I get these:

Network summary:

    Components                              188
    Density                                 0.0365
    Transitivity                            0.9484
    Mean betweenness                        0.2066
    Weighted-mean betweenness               0.1875
    Score                                   0.9138
    Score (w/ betweenness)                  0.7250
    Score (w/ weighted-betweenness)         0.7424

Removing 212 sequences

Which are indeed better, however, I'm unsure if I can trust it becaue: 1) Is it okay/normal thar I have that difference within Score, Score (w/ betweenness), Score (w/ weighted-betweenness)? 2) Everytime I run the fit-model it pops this warning and I don't know if I can ignore it (my guess is that I can't, but I have no idea of what to do): /home/malu/anaconda3/envs/PopPunk/lib/python3.9/site-packages/sklearn/mixture/_base.py:265: ConvergenceWarning: Initialization 5 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data. warnings.warn('Initialization %d did not converge. ' 3) Not sure if related, but I can't create a Cytoscape visualization ( I get "OSError: error reading from file '':basic_ios::clear: iostream error") (I can't create a cytoscape visualization even when using the listeria_example dataset) 4) dbscan is unable to identify clusters (Failed to find distinct clusters in this dataset)

The convergence error in 2) disapears if I lower the K value below 7, for K5 I get:

Network summary: Components 118 Density 0.0507 Transitivity 0.9223 Mean betweenness 0.3412 Weighted-mean betweenness 0.1778 Score 0.8756 Score (w/ betweenness) 0.5768 Score (w/ weighted-betweenness) 0.7199 Removing 256 sequences

Which is a nice fit I guess? But I'm not comfortable ignoring the previous results without understanding what caused them

My guess is that maybe my samples are too similar, and that's expected. One thing I wanted to do was filter-out redundancy using poppunk (my goal is to determine the evolution of a characteristic on Salmonella Typhimurium, hopefully, cleaning my dataset a little bit would reduce potential noise).

Here's the density plot Salmonella_mink15_maxk35_sketch_size100000_distanceDistribution

And here is the K plot of K5, I'm a little bothered that that big spot at the left is subdivided in 3 "blobs" Teste_com_K5_DPGMM_fit

It's okay to use the K5 results? Why isn't Cytoscape working?

Thanks!

johnlees commented 2 years ago

Firstly, can I just check that you've seen this part of the documentation: https://poppunk.readthedocs.io/en/latest/model_fitting.html#how-good-is-my-fit This might help give an overview of what you're aiming for here

Is it okay/normal thar I have that difference within Score, Score (w/ betweenness), Score (w/ weighted-betweenness)?

Yes, that's fine. The alternative scores only really help with datasets >10k samples https://poppunk.readthedocs.io/en/latest/model_fitting.html#alternative-network-scores

Everytime I run the fit-model it pops this warning and I don't know if I can ignore it (my guess is that I can't, but I have no idea of what to do): /home/malu/anaconda3/envs/PopPunk/lib/python3.9/site-packages/sklearn/mixture/_base.py:265: ConvergenceWarning: Initialization 5 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data. warnings.warn('Initialization %d did not converge. '

This is actually ok. In BGMM mode the fit is run five times with different start points, and the best fit taken. This just shows that one of the attempts didn't work. It does suggest that the number of components and fit might not be working that well for the data, but it's not an error.

Not sure if related, but I can't create a Cytoscape visualization ( I get "OSError: error reading from file '':basic_ios::clear: iostream error") (I can't create a cytoscape visualization even when using the listeria_example dataset) dbscan is unable to identify clusters (Failed to find distinct clusters in this dataset)

I am not sure what might be causing this based on the information here. If you'd like me to investigate further, please open a new issue ideally with a reproducible example, including the full command run, and your conda environment information.

dbscan is unable to identify clusters (Failed to find distinct clusters in this dataset)

This can happen, and seems more common with smaller datasets. You can try altering --D or --min-cluster-prop (see https://poppunk.readthedocs.io/en/latest/model_fitting.html#dbscan)

Which is a nice fit I guess? But I'm not comfortable ignoring the previous results without understanding what caused them

This is of course good science! Overall however, you are doing the right thing by looking at the density plot, and the plot of the fit. This is the best way to assess whether the fit is doing what you want, and the network scores etc can be a useful quantitative companion to decide between choices. In any mode, only the two clusters used to draw the boundary matter, so as long as they are correct everything else will work fine.

I think you have two ways you could cluster this data, either like your K = 5 fit which splits the main component closest to the origin in two (you can see in the density plot there is a centre at a = 0.05 and another one very close to zero, which the fit is pretty much getting right), or capturing all of the points in the blob closest to the origin as a single component. You could do the latter with K = 2 or 3 I imagine, or using refine fit, or just using --threshold 0.002.

My guess is that maybe my samples are too similar, and that's expected. One thing I wanted to do was filter-out redundancy using poppunk (my goal is to determine the evolution of a characteristic on Salmonella Typhimurium, hopefully, cleaning my dataset a little bit would reduce potential noise).

So it depends on your purpose. pi < 0.001 is pretty closely related (I see you've increased the sketch size which is sensible here), and just taking that as a threshold (as you've clearly got a large separation in your plot) would identify all the closely related samples.

andreaniml commented 2 years ago

Thank you so much for the detailed answer! I will try out the --threshold 0.002 K=2 or 3 were giving out poor fits .

When I try to use refine as in poppunk --fit-model refine --ref-db Salmonella_mink15_maxk35_sketch_size100000 --model-dir Teste_com_K5 --output refinedK5

I get this message:

Traceback (most recent call last): File "/home/malu/miniconda3/envs/poppunk/bin/poppunk", line 11, in sys.exit(main()) File "/home/malu/miniconda3/envs/poppunk/lib/python3.9/site-packages/PopPUNK/main.py", line 411, in main assignments = new_model.fit(distMat, refList, model, File "/home/malu/miniconda3/envs/poppunk/lib/python3.9/site-packages/PopPUNK/models.py", line 781, in fit refineFit(X/self.scale, File "/home/malu/miniconda3/envs/poppunk/lib/python3.9/site-packages/PopPUNK/refine.py", line 203, in refineFit raise RuntimeError("Optimisation failed: produced a boundary outside of allowed range\n") RuntimeError: Optimisation failed: produced a boundary outside of allowed range

It only runs if I add the --unconstrained flag (it does not improve that much, I get a 0.8947 score)

Soon I'll open an issue on cytoscape crashing

Thanks again for the answer!