MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.2k stars 765 forks source link

Huggingface transformer does not load as expected #1952

Open sutgeorge opened 7 months ago

sutgeorge commented 7 months ago

Hello,

Instead of using sentence-transformers/all-MiniLM-L6-v2, I wanted to try out a custom embedding model from Huggingface. I read previous opened and closed issues and found the following approaches:

Screenshot from 2024-04-27 13-20-17

Code:

    # romanian_embedding_model = pipeline("feature-extraction", model="readerbench/RoBERT-large", tokenizer="readerbench/RoBERT-large")
    # tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512,'return_tensors':'pt'}
    # self.bertopic_model = BERTopic(embedding_model=lambda x: romanian_embedding_model(x, **tokenizer_kwargs), verbose=True, nr_topics='auto', n_gram_range=(1, 2))
    # topics_1, probs_1 = self.bertopic_model.fit_transform(self.nontruncated_documents)
    embedding_model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
    tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
    romanian_embedding_model = pipeline("feature-extraction", model=embedding_model, tokenizer=tokenizer)
    tokenizer_kwargs = {'padding':True, 'truncation':True, 'max_length':512, 'return_tensors':'pt'}
    self.bertopic_model = BERTopic(embedding_model=lambda x: romanian_embedding_model(x, **tokenizer_kwargs), verbose=True, nr_topics='auto', n_gram_range=(1, 2))
    topics_1, probs_1 = self.bertopic_model.fit_transform(self.nontruncated_documents)

However, by using a logger like

import logging
logging.basicConfig()
logger = logging.getLogger('Something')
logger.setLevel(logging.INFO)

... I discovered that BERTopic doesn't actually load the Huggingface model. It simply loads sentence-transformers/all-MiniLM-L6-v2. Why is this the case?

Proof:

Screenshot from 2024-04-27 13-19-24

I will use the multilingual version until this issue is fixed.

Thank you for your patience 💯

MaartenGr commented 7 months ago

Instead of this:

self.bertopic_model = BERTopic(embedding_model=lambda x: romanian_embedding_model(x, **tokenizer_kwargs), verbose=True, nr_topics='auto', n_gram_range=(1, 2))

I think you should do this:

self.bertopic_model = BERTopic(embedding_model=romanian_embedding_model, verbose=True, nr_topics='auto', n_gram_range=(1, 2))

Also, I would advise not using nr_topics="auto" unless you played around with the underlying cluster model instead. See best practices.

sutgeorge commented 7 months ago

The reason why I used the lambda function and the tokenizer_kwargs was mostly because the model complained about the input size limit, which was exceeded in my case (my documents are larger than 512 tokens). I was totally unable to find an alternative. I suppose the only solution to this problem would be to filter out specific parts of speech which I don't care about, like adverbs (stopwords are already removed). Obviously, the documents could also be truncated, but I'm afraid that this approach might throw away a lot of useful data (words are not equivalent to tokens and, since I have no idea how to measure the exact amount of tokens to slice out from the input, I had to use the kwargs trick).

Thanks for the advice, I will actually remove the nr_topics and I might attempt to tune the HDBSCAN component to reduce the number of topics.

MaartenGr commented 6 months ago

The reason why I used the lambda function and the tokenizer_kwargs was mostly because the model complained about the input size limit, which was exceeded in my case (my documents are larger than 512 tokens). I was totally unable to find an alternative. I suppose the only solution to this problem would be to filter out specific parts of speech which I don't care about, like adverbs (stopwords are already removed). Obviously, the documents could also be truncated, but I'm afraid that this approach might throw away a lot of useful data (words are not equivalent to tokens and, since I have no idea how to measure the exact amount of tokens to slice out from the input, I had to use the kwargs trick).

You could also use sentence-transformers instead to load the model. I believe it might handle the truncation a bit better. Also, note that a multi-lingual embedding model might outperform a BERT-like model that was not trained specifically to generate embeddings for semantic similarity.