MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

The option nr_topics seems useless #14

Closed lzy318 closed 3 years ago

lzy318 commented 3 years ago

Hello! This work is remarkable! I got a problem when I trained a topic model using Chinese text data and my own sentence embeddings:

捕获

The info given by the program suggests that the number of topics had been reduced to 30, but when I accessed the results using get_topics(), I found there were still 93 topics, why this happened?

By the way, I sometimes came across Memory Out of Limit Error when running this package on my data, I think the reason is that I have millions of texts. Do you have any suggestions on how to apply this package to millions of texts?

MaartenGr commented 3 years ago

To answer your first question, I am not entirely sure why it still gives back 93 topics... Just to make sure, which version of BERTopic are you currently using??

The Memory Out of Limit Error is indeed likely due to the number of text you passed to the model. I believe it is most likely that this error happens in the class-based TF-IDF fitting rather than the UMAP or HDBSCAN fitting. The resulting sparse matrix from the class-based TF-IDF is quite large since the generated vocabulary is similarly quite large. This could be prevented by setting a minimum frequency in the CountVectorizer (which is currently not accessible). Can you show me the exact error message you are getting? This will help me identify the location of the error.

lzy318 commented 3 years ago
    Thanks for your kindly and quick reply. I am using bertopic-0.3.1. I am not sure if it is caused by Chinese characters.I think the memory out of limit error happened in the UMAP fitting process, as the program didn't report following messages before it stopped. I think there is a nearest neighbour searching in UMAP, so it will consume a lot of memory. I guess it is not a big problem as I can train the model on a random subset and then fit others using the model, right?

To answer your first question, I am not entirely sure why it still gives back 93 topics... Just to make sure, which version of BERTopic are you currently using?? The Memory Out of Limit Error is indeed likely due to the number of text you passed to the model. I believe it is most likely that this error happens in the class-based TF-IDF fitting rather than the UMAP or HDBSCAN fitting. The resulting sparse matrix from the class-based TF-IDF is quite large since the generated vocabulary is similarly quite large. This could be prevented by setting a minimum frequency in the CountVectorizer (which is currently not accessible). Can you show me the exact error message you are getting? This will help me identify the location of the error.

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or unsubscribe.

MaartenGr commented 3 years ago

Okay, so I found the issue as some topics were mapped to each other which resulted in fewer topics being mapped than I actually intended. Upgrade to the newest version (0.3.2) and you should have no more issues with the number of reduce topics:

pip install --upgrade bertopic

With respect to the memory error, I agree that it would be best so simply use a subset of the data. I believe that if you are already using millions of texts, a smaller subset would most likely give you the same performance. At some point, adding a lot of new data points won't have a significant effect on the quality of the topics as they are already contained in the subset.

lzy318 commented 3 years ago

Thanks a lot. It works now. Thanks for your quick and helpful reply.