MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.97k stars 747 forks source link

different clusters, different runs #1565

Closed aph61 closed 10 months ago

aph61 commented 11 months ago

Hi,

I'd like some input on an idea I have, but before starting to hack, I'd like your comments

The algorithm gives different topics (probability matrices) between runs; the differences can be substantial (I've had cases where for a single sentence the probability varied 0.3 points). The cause is the stochastic nature of UMAP.

One proposal is that by fixing the random seed the results becomes deterministic. My problem with this approach is that you freeze the solution, which is often good enough but incorrect for projections. I solved this by running the complete simulation n times, and average the probabilities in the sentence-topic matrix over the different runs. That gives very repeatable results (two probability matrices, each averaged over 7 runs).

Right now I calculate the final probability matrix as the average over several simulations, and use the final one in determining the topics of a sentence, keywords, etc. Is it a valid approach (is it possible) to average over several UMAP projections to make the final result more repeatable?

def _reduce_dimensionality(self, embeddings: Union[np.ndarray, csr_matrix], y: Union[List[int], np.ndarray] = None, partial_fit: bool = False) -> np.ndarray:

line

umap_embeddings = self.umap_model.transform(embeddings)

thanks

Andreas

PS: another advantage (?) of it would be that you need less data to get a reliable model??

MaartenGr commented 11 months ago

One proposal is that by fixing the random seed the results becomes deterministic. My problem with this approach is that you freeze the solution, which is often good enough but incorrect for projections. I solved this by running the complete simulation n times, and average the probabilities in the sentence-topic matrix over the different runs. That gives very repeatable results (two probability matrices, each averaged over 7 runs).

I am a bit confused. You mention here that you average the probabilities in the sentence-topic matrix (which would be probs) to generate repeatable results but here:

Right now I calculate the final probability matrix as the average over several simulations, and use the final one in determining the topics of a sentence, keywords, etc. Is it a valid approach (is it possible) to average over several UMAP projections to make the final result more repeatable?

you mention averaging the UMAP embeddings to create more repeatable results. If you are averaging the probs, how would you do that if the number of topics differ between runs?

Either way, both are worthwhile solutions to the potential underlying problem but it seems that averaging probabilities in topics could be tricky if different number of topics are created. You indeed could average out the embeddings since the order in which it appears (before clustering) seems to make the most sense here.

One thing to note though is that there is a risk of focusing on the "stochastic" nature of UMAP. When you do that and try to optimize the projections' stability between different seeds, shouldn't you then also try different parameters of UMAP itself to average over? Stable results are indeed important but they can be viewed as parameters to tweak.

My problem with this approach is that you freeze the solution, which is often good enough but incorrect for projections.

This is not necessarily the case and something you would have to test yourself. Essentially, you are saying that because of the stochastic nature of UMAP some projections could be quite inaccurate. You see this back in large differences in probabilities which are generated through HDBSCAN. So technically, the problem could also lie with HDBSCAN instead. For example, it could be highly sensitive to small changes in the projections and their respective locations even though the differences between projections could be marginal.

I am not saying the issue lies with HDBSCAN but stocastischness is not an issue when the embeddings are accurate even though they might be different between runs. Just an idea.

PS: another advantage (?) of it would be that you need less data to get a reliable model??

Not necessarily, the underlying UMAP model still needs to learn representations from the input data and although it is stochastic, it does not mean that simply running it a couple of times would be as performant as adding more data.

Technically, methods like k-Means are also stochastic depending on the random initialization of the clusters but more data would generally be better.

aph61 commented 10 months ago

Hi Maarten,

thanks for your comments. I've given it also some further thought, and I believe that what I want is theoretically not possible. You're mapping an N-dim space to fewer dimensions, and a deterministic method (steepest descent for example will get you in local minima; you'' also will have more minima (topics) to deal with. You overcome the problem false minimum problem by introducing randomness.

A strategy is that you run a number of unconstrained (minimum-size) simulations, and take the average of the # topics for subsequent simulations. If the # topics is fixed you can internally run UMAP several times to improve your estimates of the topics (content, size)

I did not try the last steps yet, but I made 3D plots of an existing topic distribution and obtained great results (akin https://umap-learn.readthedocs.io/en/latest/supervised.html).

My take: umap averaging does not work in fully unconstrained unsupervised model building. Averaging is possible if you do a constrained unsupervised simulation. The averaging over umap gives you a better estimate for the topic model (the projection of the N-dim space).

Intuitively I'd say that to calculate the topic distribution of a new document you've still need to do a transform several times The estimate of the distribution of the new document remains stochastic, just relative to a better model than before

Andreas

PS: Maybe best to move this discussion to the discussion branch