Relationship between threshold and # clusters

jbmelander commented 1 year ago

The relationship between the phase1_detect_threshold and the number of found clusters is a bit surprising to me. For example, if I use a detect_sign of 0 and a threshold of 1.5, >20000 snippets are found, but this results in only 3 clusters. If I use a threshold of 8.5, ~5000 snippets are found, resulting in ~30 clusters. I would have expected that including more snippets resulted in more clusters. Do you have any advice for optimizing this parameter?

jbmelander commented 1 year ago

Similarly, increasing training_duration_sec results in less clusters. I am wondering if I should be increasing components alongisde number of snippets for training.

magland commented 1 year ago

The relationship between the phase1_detect_threshold and the number of found clusters is a bit surprising to me. For example, if I use a detect_sign of 0 and a threshold of 1.5, >20000 snippets are found, but this results in only 3 clusters. If I use a threshold of 8.5, ~5000 snippets are found, resulting in ~30 clusters. I would have expected that including more snippets resulted in more clusters. Do you have any advice for optimizing this parameter?

This is tricky. If the threshold is too low, then what I often find is that clusters get merged together because they can be merged with large noise clusters. This would explain the fewer number of clusters with lower detect threshold.

I don't have any guidance for you on this, since I think the optimal choice will depend very much on the type of dataset.

jbmelander commented 1 year ago

OK. Thanks. I just wanted to confirm that this was expected behavior.

flatironinstitute / mountainsort5

Relationship between threshold and # clusters #11