MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.17k stars 764 forks source link

Combining BERTopic with customized word embeddings and apply it to large documents #1183

Closed ZaneFra closed 1 year ago

ZaneFra commented 1 year ago

I am currently attempting to apply BERTopic to company annual reports. However said documents include more than 512 tokens, which is the limit for all bert-based models. Is splitting the document into sentences and then apply the model to each sentence a good way to circle around this issue?

Another doubt concerns customized embeddings. All of the example shown use the sentencetransformer library to generate embeddings at sentence level. Given the peculiar nature of my data, I want to use finBERT, which has been trained specifically on financial corpus. However, this model generates word embeddings rather than sentence embeddings. If I train BERTo with these word embeddings, does it have any substantial impact on its performance or not? Do I need to make changes in other areas of the model to use word embeddings?

MaartenGr commented 1 year ago

I am currently attempting to apply BERTopic to company annual reports. However said documents include more than 512 tokens, which is the limit for all bert-based models. Is splitting the document into sentences and then apply the model to each sentence a good way to circle around this issue?

Generally, that is indeed a good procedure to take those token limits into account. There are a few more methods that you can use. For example, if you expect that an annual report contains the same topic in the first 512 tokens as in all tokens afterward, then it is not necessary to split the document up into sentences/paragraphs. The great thing about using something like c-TF-IDF is that there are no token limits and therefore take the entire documents into account. If that is not the case, you could still leave the documents as they are and use approximate_distribution afterward to get a more fine-grained insight into the topic distributions of each document.

Having said that, just splitting them up into sentences generally works quite well.

Another doubt concerns customized embeddings. All of the example shown use the sentencetransformer library to generate embeddings at sentence level. Given the peculiar nature of my data, I want to use finBERT, which has been trained specifically on financial corpus. However, this model generates word embeddings rather than sentence embeddings. If I train BERTo with these word embeddings, does it have any substantial impact on its performance or not? Do I need to make changes in other areas of the model to use word embeddings?

That is very difficult to see beforehand as it depends on quite a number of things. The quality of word embeddings, whether they are optimized for similarity tasks, the data and its quality, etc. However, if the sentence-transformers models cannot capture your specific data not perfectly due to specific terms, it might be worthwhile to use a different model. This, however, would require trial and error.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!