Open sutgeorge opened 7 months ago
Instead of this:
self.bertopic_model = BERTopic(embedding_model=lambda x: romanian_embedding_model(x, **tokenizer_kwargs), verbose=True, nr_topics='auto', n_gram_range=(1, 2))
I think you should do this:
self.bertopic_model = BERTopic(embedding_model=romanian_embedding_model, verbose=True, nr_topics='auto', n_gram_range=(1, 2))
Also, I would advise not using nr_topics="auto"
unless you played around with the underlying cluster model instead. See best practices.
The reason why I used the lambda function and the tokenizer_kwargs
was mostly because the model complained about the input size limit, which was exceeded in my case (my documents are larger than 512 tokens). I was totally unable to find an alternative. I suppose the only solution to this problem would be to filter out specific parts of speech which I don't care about, like adverbs (stopwords are already removed). Obviously, the documents could also be truncated, but I'm afraid that this approach might throw away a lot of useful data (words are not equivalent to tokens and, since I have no idea how to measure the exact amount of tokens to slice out from the input, I had to use the kwargs
trick).
Thanks for the advice, I will actually remove the nr_topics
and I might attempt to tune the HDBSCAN component to reduce the number of topics.
The reason why I used the lambda function and the tokenizer_kwargs was mostly because the model complained about the input size limit, which was exceeded in my case (my documents are larger than 512 tokens). I was totally unable to find an alternative. I suppose the only solution to this problem would be to filter out specific parts of speech which I don't care about, like adverbs (stopwords are already removed). Obviously, the documents could also be truncated, but I'm afraid that this approach might throw away a lot of useful data (words are not equivalent to tokens and, since I have no idea how to measure the exact amount of tokens to slice out from the input, I had to use the kwargs trick).
You could also use sentence-transformers instead to load the model. I believe it might handle the truncation a bit better. Also, note that a multi-lingual embedding model might outperform a BERT-like model that was not trained specifically to generate embeddings for semantic similarity.
Hello,
Instead of using
sentence-transformers/all-MiniLM-L6-v2
, I wanted to try out a custom embedding model from Huggingface. I read previous opened and closed issues and found the following approaches:Code:
However, by using a logger like
... I discovered that BERTopic doesn't actually load the Huggingface model. It simply loads
sentence-transformers/all-MiniLM-L6-v2
. Why is this the case?Proof:
I will use the multilingual version until this issue is fixed.
Thank you for your patience 💯