TutteInstitute / evoc

Embedding Vector Oriented Clustering
BSD 2-Clause "Simplified" License
94 stars 3 forks source link

Feedback on `best_layer` selection #8

Open zilch42 opened 1 month ago

zilch42 commented 1 month ago

Hi there,

I'm really enjoying looking through this library. I wanted to provide some thoughts on the selection of the 'best layer' returned by fit_predict. Firstly I just wanted to check my understanding. Is the approach that the layer that produces the fewest outliers is assumed to be the one that best fits the data and that is therefore the one returned by fit_predict?

Initially I thought it was a bug that fit_predict wasn't returning the most granular layer contained in cluster_layers (in my case it was returning cluster_layers[1]) until I went looking through the code and found the best_layer calculation.

It wasn't intuitive or clear to me that the most granular layer wasn't the one returned so I think that needs to be outlined in the doc strings of both EVoC and fit_predict.

Secondly, I think it would be good to have some option of what layer is returned by fit_predict. While it is easy enough to get the most granular layer from cluster_layers[0] explicitly, for some use cases (e.g. using EVoC as a drop in clusterer in BERTopic), BERTopic is just going to call fit_predict and return whatever it thinks is best. If the user sets base_min_cluster_size to try and control the level of granularity that they expect in the resulting clusters, but then EVoC chooses a different layer as the best layer, then the user won't be getting the level of granularity they expect.

It might be nice to introduce something like layer_selection = ['best', 'bottom', 'top'] so the user can force fit_predict to return the most granular layer if desired. best could be called fewest_outliers or something to be more explicit about how 'best' is being determined. It could also take an integer to just select a given layer (with the top layer returned if the integer is out of range).

Just some ideas.

lmcinnes commented 1 month ago

Yes, I think a few more options there might be useful. I haven't really gotten to the point of properly documenting everything, and have been travelling and on vacation for the past month. I'll see what I can do when I get back to working on this and have time available.