MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.09k stars 758 forks source link

"doc_length" doesn't work with llama3.1 #2185

Open mjin990 opened 6 days ago

mjin990 commented 6 days ago

Have you searched existing issues? 🔎

Desribe the bug

I am using beropic with llama3.1 for topic modelling. My text is long, so I use doc_length in TextGeneration().

Error:

File "/home/bert/lib/python3.11/site-packages/bertopic/representation/_utils.py", line 57, in truncate_document return truncated_document ^^^^^^^^^^^^^^^^^^ UnboundLocalError: cannot access local variable 'truncated_document' where it is not associated with a value

Reproduction

from bertopic import BERTopic

**here is my code:**

llama3 = TextGeneration(generator, prompt=prompt, nr_docs=4,doc_length=3000)
representation_model = {
    "Llama3": llama3
}
topic_model = BERTopic(
  embedding_model=embedding_model,
  representation_model=representation_model,
  umap_model=umap_model, 
  hdbscan_model=hdbscan_model,
  nr_topics = nr_topics,
  min_topic_size = 10,
  verbose=True,
)

BERTopic Version

0.16.4

MaartenGr commented 6 days ago

Thanks for sharing! I believe you also need to specify the tokenizer for it to work. There's also a PR open for a fix that I will check out later this week. That said, should work by specifying tokenizer.

mjin990 commented 6 days ago

Thanks for sharing! I believe you also need to specify the tokenizer for it to work. There's also a PR open for a fix that I will check out later this week. That said, should work by specifying tokenizer.

Thanks for your quick reply! Yes, it works after adding tokenizer.