Clustering may not be reproducible with some large data sets

stitam commented 1 year ago

Versions

PoPPUNK v2.5.0 in a singularity container. Link to the dockerfile: https://github.com/StaPH-B/docker-builds/tree/master/poppunk/2.5.0. Not sure about pp-sketchlib version because the command returned executable file not found in $PATH.

Command used and output returned

poppunk --create-db --output ppdb_13_29_4_0.05_4l --r-files rlist.txt --threads 30 --plot-fit 5 --min-k 13 --max-k 29 --k-step 4 --min-cluster-prop 1e-05 --max-zero-dist 0.005

poppunk --fit-model dbscan --ref-db ppdb_13_29_4_0.05_4 --output results_13_29_4_0.05_4 --threads 30

Describe the bug

My main issue is that poppunk fails to cluster genomes appropriately when I run the tool for 15000 genomes (before scaling up I ran the tool with about 500 genomes, and clustering completed successfully).

First, I ran poppunk with default parameters, everything clustered into a single poppunk cluster. Then, I set up an experiment where I altered the following parameters: min-k (13 or 16), max-k (29 or 31), k-step (3 or 4), max-zero-dist (0.005 or 0.05), min-cluster-prop (0.00001, or 0.0001). 13_29_4_0.05_4 in the section above indicates min-k 13, max-k 29, k-step 4, max-zero-dist 0.05, min-cluster-prop 0.0001 (-log10 scale).

Most parameter combinations produced terrible results. Surprisingly, one completed successfully: the one with default parameters. So I ran poppunk again with default parameters again, and then it failed, again. The most obvious cause is that I keep overlooking something (working on it), but if this is not the case, then I am wondering if e.g. poppunk generates random numbers anywhere? Can you please clarify this? Many thanks.

johnlees commented 1 year ago

My main issue is that poppunk fails to cluster genomes appropriately when I run the tool for 15000 genomes (before scaling up I ran the tool with about 500 genomes, and clustering completed successfully).

It may be best in that case to just use your model for 500, and then use poppunk_assign with --update-db to extend to the extra samples.

First, I ran poppunk with default parameters, everything clustered into a single poppunk cluster. Then, I set up an experiment where I altered the following parameters: min-k (13 or 16), max-k (29 or 31), k-step (3 or 4), max-zero-dist (0.005 or 0.05), min-cluster-prop (0.00001, or 0.0001). 13_29_4_0.05_4 in the section above indicates min-k 13, max-k 29, k-step 4, max-zero-dist 0.05, min-cluster-prop 0.0001 (-log10 scale).

None of these parameters really make any difference to the model fitting step. You'd be best first making your database and using --plot-fit to check that the k-mer lenghts are working ok. Then using --qc-db to remove any samples (and checking the distance plots before and after).

Please see the documentation for the model fitting for more information on that step: https://poppunk.readthedocs.io/en/latest/model_fitting.html For dbscan you have --D and --min-cluster-prop, but most likely for this data you want to try another model, probably refine. Check your distance plot to see which is best.

Most parameter combinations produced terrible results. Surprisingly, one completed successfully: the one with default parameters. So I ran poppunk again with default parameters again, and then it failed, again. The most obvious cause is that I keep overlooking something (working on it), but if this is not the case, then I am wondering if e.g. poppunk generates random numbers anywhere? Can you please clarify this? Many thanks.

You need to try different models, and decide on which by looking at your data from the first step.

PopPUNK uses random numbers in a few places: bgmm and dbscan models both use them for initial states (which can determine whether the final model converges or not, bgmm automatically has a few retries and takes the best). They are also used in correcting small k-mer distances, although this typically doesn't make much difference.

NB: We are not aiming for every model fit run to be reproducible – stable clustering and reproducibility is achieved through use of poppunk_assign with a previously defined model.

stitam commented 1 year ago

Many thanks @johnlees for helping me with this.

I ran create-db with a subset of genomes and two argument combinations, 1. default and 2. --min-k 16 --max-k 31 --k-step 3. --plot-fit looked okay (linear with negative slope) for both. The difference between the two outcomes was that for the lowest k-mer length, the raw and corrected matching k-mer proportions were quite distant for the default settings. I couldn't find anything about this is the manual but it felt like a problem (this is why I started adjusting the k-mer parameters in the first place).

Also, with the default combination my distanceDistribution plot was non-monotonic, it had a maximum. With that alternative k-mer parameters there was no maximum, it looked similar to the figures in the poppunk paper.

Can you confirm that the distant raw/corrected proportions and the maximum I saw are indicative of poor sketching?

johnlees commented 1 year ago

Can you confirm that the distant raw/corrected proportions and the maximum I saw are indicative of poor sketching?

Yeah, it does sound like you just needed to start with a longer minimum k-mer length in this case

stitam commented 1 year ago

I think I am getting closer to the source of the issue. I generated subsets using various strategies, ran the analysis and finally visualised the networks using cytoscape. It seems every now and then a few genomes link together otherwise distant clusters. When I try larger sets I (by chance) add more of these linkers and ultimately everything collapses into a huge cluster and some leftovers. I suppose the next step forward is ensuring these linkers are not added to the database.

https://poppunk.readthedocs.io/en/latest/visualisation.html

johnlees commented 1 year ago

Have you tried running --qc-db? This sounds like potentially low quality data, and we also have options like --max-merge in assign mode to remove these

bacpop / PopPUNK

Clustering may not be reproducible with some large data sets #247