Using Multiboundary/iterative option

andreaniml commented 1 year ago

Hi

I'm having trouble understanding the iterative poppunk output. I have two files, [prefix]_clusters.csv and [prefix]_cutoff_clusters.csv. If I want to check the clusters determined by the iterative process I would need to take the 'cutoff_clusters' file, right?

And still on that matter, I'm not sure how to setup the cutoff, I'm failing to understand how it works. I've got more clusters using a cutoff of 0.9 than using a cutoff of 0.45, and if I try a really small cutoff it goes up again. What am I missing?

Many thanks!

johnlees commented 1 year ago

The cutoff defined clusters are a second, optional, step. The clusters defined in the pre-print are in the first file, not the cutoff file.

Regarding the cutoff clusters: @BZhao95 are you able to advise here?

BZhao95 commented 1 year ago

Hello,

The “[prefix]_cutoff_clusters.csv” file gives you the clusters using a certain cutoff value.

Regarding the problem of clusters you got, it would help if you can share us a picture of the unrooted iterative PopPUNK tree (the tree was saved as [prefix]_iterate.tree.nwk).

andreaniml commented 1 year ago

Hi! This is the tree for 0.01 cutoff, I think it's not easy to read given the number of samples (same tree, different layout, tips removed)

As I am testing different dataset compositions and this one is rather messy, I think I will not carry it further, however I'm curious to know why did the numbers of clusters rise when I use a large cutoff

BZhao95 commented 1 year ago

Hi

Thank you for the tree plots. We can see that there is a very big cluster (e.g. cluster1) in your tree with some singletons or small cousin clusters next to it. The average core distance (ACD) of its parent cluster is greatly affected by these singletons or small clusters in your dataset. I am going to simplify your tree and will use 3 example figures to answer your question.

Here, Cluster1 is the big cluster with a relatively low ACD value (0.4), Cluster3 is a small cluster with an extreme high ACD (0.9), Cluster2 is the parent cluster of Cluster1 and Cluster3. The ACD of this whole branch (Cluster2) is 0.45.

Figure A: When you choose a very small cutoff, there is no node being selected, leaving a set of 8 clusters (8 singletons) Figure B: When the cutoff is 0.5, cluster2 is being selected. Only 1 cluster left Figure C: When a higher cutoff 0.95 is adopted, cluster3 is chosen and the rest 6 isolates are left as singletons (cluster3 + 6 singletons = 7 clusters)

I hope it explains why the number of clusters rose when you used a larger cutoff.

Suggestions:

To solve this problem, you can check the “[prefix].clusters.csv” to see if you got a small cluster with extreme high ACD. If so, then go back to check the PopPUNK distribution plot (“[prefix]_distanceDistribution.png”) to see if there is any distanced component. You can add “——qc-db” (https://poppunk.readthedocs.io/en/latest/qc.html) to remove low quality (distanced) samples in your dataset. In the above examples, if we remove these two isolates from cluster3, we can solve your problem.

andreaniml commented 1 year ago

Hi! Thank you very much for taking the time to answer my question with such detail and the suggestions, my distance plot has indeed some high values of accessory distances (although I'm not surprised as I'm dealing with different serovars of Salmonella).
I will run the qc command, see if anything is changed and go from there.

Cheers!

bacpop / PopPUNK

Using Multiboundary/iterative option #235