MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.06k stars 755 forks source link

flan-t5 for #1720

Open mjaved-nz opened 9 months ago

mjaved-nz commented 9 months ago

Hi @MaartenGr,

I hope you are doing well. I am getting the following error when using the flan-t5 model for topic representation. Any solution for this? Thanks

from transformers import pipeline
from bertopic.representation import TextGeneration

prompt = "I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?"

# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator)

topic_model_t5 = BERTopic(representation_model=representation_model)

Error:

----> 2 topics, probs = topic_model_t5.fit_transform(docs) 3 print(topic_model_t5.get_topic_info())

/usr/local/lib/python3.10/dist-packages/bertopic/representation/_textgeneration.py in extract_topics(self, topic_model, documents, c_tf_idf, topics) 145 146 # Prepare prompt --> 147 truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs] 148 prompt = self._create_prompt(truncateddocs, topic, topics) 149 self.prompts.append(prompt)

TypeError: 'NoneType' object is not iterable

MaartenGr commented 9 months ago

It seems like it cannot iterate over the documents for whatever reason. Did you make sure that all documents are no-empty? That also means that very short documents that contain for instance "\n" should also be removed.

mjaved-nz commented 9 months ago

I don't have any empty documents the minimum length of the documents is 21. The same set of documents works fine with other LLMs.

MaartenGr commented 9 months ago

I don't have any empty documents the minimum length of the documents is 21.

How did you calculate the length of the document? Tokenization schemes of the underlying model might handle certain documents differently.

The same set of documents works fine with other LLMs.

Which other LLMs did you try? Did it work with TextGeneration or something else?

Also, on how many documents did you train your model? It might be that there are only a couple of documents per topic and that it might not properly return a document.

manveersadhal commented 9 months ago

Hi @MaartenGr - thank you for creating this fantastic library.

I think the cause is that when the DEFAULT_PROMPT is used (which has no [DOCUMENTS]) or a user-supplied prompt does not contain "[DOCUMENTS]", the docs in repr_docs_mappings are all assigned a value of None. The error occurs when trying to iterate over None.

I created a pull request to address this. Please review and merge if you find this to be a suitable fix.