Open zilch42 opened 1 month ago
Yes, I think a few more options there might be useful. I haven't really gotten to the point of properly documenting everything, and have been travelling and on vacation for the past month. I'll see what I can do when I get back to working on this and have time available.
Hi there,
I'm really enjoying looking through this library. I wanted to provide some thoughts on the selection of the 'best layer' returned by
fit_predict
. Firstly I just wanted to check my understanding. Is the approach that the layer that produces the fewest outliers is assumed to be the one that best fits the data and that is therefore the one returned byfit_predict
?Initially I thought it was a bug that
fit_predict
wasn't returning the most granular layer contained incluster_layers
(in my case it was returningcluster_layers[1]
) until I went looking through the code and found thebest_layer
calculation.It wasn't intuitive or clear to me that the most granular layer wasn't the one returned so I think that needs to be outlined in the doc strings of both
EVoC
andfit_predict
.Secondly, I think it would be good to have some option of what layer is returned by
fit_predict
. While it is easy enough to get the most granular layer fromcluster_layers[0]
explicitly, for some use cases (e.g. using EVoC as a drop in clusterer inBERTopic
), BERTopic is just going to callfit_predict
and return whatever it thinks is best. If the user setsbase_min_cluster_size
to try and control the level of granularity that they expect in the resulting clusters, but thenEVoC
chooses a different layer as the best layer, then the user won't be getting the level of granularity they expect.It might be nice to introduce something like
layer_selection = ['best', 'bottom', 'top']
so the user can forcefit_predict
to return the most granular layer if desired.best
could be calledfewest_outliers
or something to be more explicit about how 'best' is being determined. It could also take an integer to just select a given layer (with the top layer returned if the integer is out of range).Just some ideas.