Closed leifericf closed 2 years ago
It highly depends on the documents and the use case for which you are optimizing, so unfortunately I cannot give any general advice. Having said that, it might be interesting to use the n_neighbors=10
and min_topic_size=100
as that already gives you 112 which is relatively close to the 50-100 meaningful topics. Using those parameters and then setting nr_topics=50
, nr_topics=100
, or somewhere in between, might result in the meaningful topics that you are looking for.
Do note that I would highly advise focusing on manual inspection of the topics first to get an idea of whether 50 or 100 topics actually are interpretable or make sense. Some documents simply have many topics and trying to reduce that value, by forcing topics together, may result in vague and difficult to interpret topics.
Personally, I like to focus mostly on min_topic_size
as it directly relates to the number of topics that will be generated. Typically, the default n_neighbors
works quite well and although optimizing might improve the results, I think min_topic_size
will get you much further and faster.
Thank you for the advice, @MaartenGr!
I will change my strategy to use the default value for n_neighbors
and try different values for min_topic_size
below and above 100, then inspect the topics manually to gauge how meaningful they are.
Here are some tentative results of my experimentation so far, in case it's interesting or useful for someone else.
min_topic_size ,topics ,topics_auto_reduced ,topics_change ,topic_none ,topics_auto_reduced_none ,topics_none_change
10 , 1113 , 811 ,27% , 38163 , 38163 ,0%
20 , 553 , 323 ,42% , 36221 , 36221 ,0%
40 , 290 , 193 ,33% , 37180 , 37180 ,0%
60 , 181 , 96 ,47% , 33070 , 33070 ,0%
80 , 131 , 40 ,69% , 35162 , 39724 ,-13%
100 , 112 , 39 ,65% , 34459 , 38797 ,-13%
120 , 90 , 55 ,39% , 36987 , 36987 ,0%
140 , 71 , 32 ,55% , 35593 , 35593 ,0%
160 , 67 , 47 ,30% , 38104 , 38104 ,0%
180 , 56 , 30 ,46% , 36887 , 36887 ,0%
200 , 48 , 24 ,50% , 37366 , 37366 ,0%
220 , 44 , 22 ,50% , 39044 , 39044 ,0%
240 , 42 , 20 ,52% , 37468 , 37468 ,0%
260 , 38 , 21 ,45% , 38707 , 38707 ,0%
280 , 6 , 6 ,0% , 84249 , 84249 ,0%
300 , 35 , 26 ,26% , 40767 , 40767 ,0%
350 , 6 , 6 ,0% , 83163 , 83163 ,0%
n_neighbors
was set to the default value of 15 for all the above runs.
In particular, I was surprised to see the number of topics increase with min_topic_size=300
, and then the number of topics went back down to 6 with min_topic_size=350
. How peculiar!
The columns labeled *_none
is a count of documents with "topic" -1. For min_topic_size=80
and min_topic_size=100
there was a 13% increase after automatic topic reduction. I expected to see an increase in -1 after all reductions.
Great, thank you for sharing this!
I have also tried hundreds of experiments and I found that the number of outlier can be drastically reduced, but it causes them to be merged into the largest clusters thus leading to poor interpretability.
I've been doing a lot of work around the effect of config parameters on number of topics / size of topics and number of outliers. A couple of comments/suggestions regarding UMAP and HDBSCAN parameter searches:
1) Tuning these parameters with the BERTopic model is very expensive. For example if you want to change and HDBSCAN parameter - the most lightweight of the options, you will wind up re-running UMAP every time - very expensive. 2) UMAP might be the issue. I would separate out that investigation from the HDBSCAN parameters. 3) In my preliminary work I find that (in agreement with the UMAP doco) nearest neighbors is important. I strongly suspect that problems in UMAP will be reflected in the separation of major clusters - for example if you can look at a UMAP visualization and clearly see say 5 different clusters and 4 of them are closer to each other and separated by a noticeably large space from the 5th which is very large, then this is a particular case that might not solve very well. In the above case it may be that the 4 clusters and fifth cluster have geometries that are difficult for UMAP to model correctly (in this example). So tweaking the UMAP parameters might work (and has for me without too much trouble), but it is also possible that you might consider segmenting out the data (in this case separate out the 1,2,3,4 from 5) and model them differently. 4) Once you have a UMAP you can live with (or think you can) then running LOTS of HDBSCAN trials is very inexpensive - and this is where you will likely see the most dramatic returns. 4) With HDBSCAN changing min_sample_size and min_cluster_size have large and often very unexpected results (sample size is relative to cluster size) and so you should try lots of different combinations. 5) To do all this efficiently I recommend pulling out the embeddings and running HDBSCAN and UMAP independently from BERTopic to avoid the otherwise very expensive overhead. 6) Visualize, visualize, visualize. In this case reading through lots of results is VERY hard to follow. It is almost impossible to clearly see the effects of relatively small changes in the parameters. I suggest both creating 2D scatter plots of the embedding reductions themselves as well as using a parameter testing tool (like wandb.ai) so you can really see what is going on.
If you are interested there is code attached to #574 and I'm happy to answer questions/help. I have spent a fair amount of time on this and am growing more confident that there are some general approaches here that are doable and worth doing - although because I have only worked with a relatively small number of corpi who knows? Personally I've had great results so far with this approach. Really cut down on the number of outliers and greatly improved the coherence and comrehensibility of the topics.
I agree with you. cutting down on the number of outliers can greatly improved the coherence score. However, highest coherence score doesn't represent best results in my case. Many outliers are merging into topic 0 and making the keywords less meaningful.
I might have been confusing with my last comment. I'm not talking about coherence scores - I mean observed coherence, a subjective judgement that each of the topic vocabularies are coherent. I've pretty much given up on the usual metrics as they simply haven't performed very well on the limited sets of corpi that I've been using. What is getting me interested is that visualizations of the underlying embeddings give a very good representation of the actual structure of the underlying data and the clustering algorithms do a very good job assuming that the data itself is overall conducive to a single pass of a given clusterer. If it isn't then splitting the data into more manageable parts seems to be very viable.
Thank you for your input, everyone!
Some of this is above my head, as I'm pretty new to NLP and topic modeling.
I think the results are sufficient for my current use case, at least for the time being, but I will work on improving the model after I have delivered a first end-to-end working version of my analysis and application.
Now that the data gathering, clean-up, and topic modeling are sufficient, I'll be working on the web application for the business users to explore and consume the model to get feedback. Then I will come back to improve the model.
I understand that finding good values for
n_neighbors
andmin_topic_size
is difficult, as it depends on the number of documents, the length of those documents, and how many topics one would like to obtain, and so on.I'm currently training my model with these parameter, to get a feeling for what works:
It will take 5~6 hours to try those 50 permutations on my machine with 286.817 documents.
With
n_neighbors=15
andmin_topic_size=10
(the default values), I get 1114 topics before topic reduction. Automatic topic reduction gets it down to 812 topics, but many small and ambiguous outliers exist.With
n_neighbors=10
andmin_topic_size=100
, I get 112 topics before topic reduction. Automatic topic reduction gets it down to 37 topics.n_neighbors=10
andmin_topic_size=50
yields 32 topics after reduction."Topic" -1 remains relatively stable at 35.000-38.000 documents.
I aim to end up somewhere near 50-100 meaningful topics that can be read and interpreted by human beings.
What if any advice or rules of thumb can I apply to shorten the search for more appropriate parameters?