Closed zhimin-z closed 1 year ago
These previous posts https://github.com/MaartenGr/BERTopic/issues/727, https://github.com/MaartenGr/BERTopic/issues/725, https://github.com/MaartenGr/BERTopic/issues/378 seems no help to me. I cannot visualize the topic but I did successfully run the topic modeling. BTW, the dataset is a small one (~345 paragraphs and each having fewer than 3000 words).
Does it have anything to do with too few topics?
That indeed might be the case, with only 2 topics each datapoint cannot take 2 nearest neighbors as defined in UMAP. I believe in order to run this correctly, you would need to have at least 3 topics. Perhaps it would be worthwhile to tweak HDBSCAN a bit in order to create more topics. For example, by lowering min_cluster_size
. Having said that, with small datasets, I would typically recommend something like k-Means, which allows you to set n_clusters
, instead as that can capture clusters for smaller datasets a bit more straightforwardly.
@zhimin-z @MaartenGr could one of you resolve this issue? If yes with which parameters?
I need a way to extract topics or keywords from short news headlines like
Is that even possible with BERTopic?
@fabmeyer If you are running into the issue of having too few topics, then you can use the min_topic_size
parameter for that. Reducing that value will increase the number of topics. If you are using a custom HDBSCAN model, then you can use min_cluster_size
for that. Finally, if you are interested in extracting keywords without needing some overarching topics, you can use KeyBERT instead.
@MaartenGr Thanks for your fast reply Maarten. I rather neeed something like overarching topics. I have seen that you also have a version that can run with LLMs. Which of your many libaries is the best for overarching topic extraction/mining? :D
@fabmeyer No problem! It depends on the size of your data. If you just have a couple of documents (e.g., < 100) then it would make sense to either just label the documents yourself or use something like KeyBERT. For that amount of data, I'm not sure whether there is actually a use case for topic modeling. However, it could definitely still work with a clustering model like k-Means in BERTopic.
For larger datasets, BERTopic is definitely something that fits within most use cases due to its modular nature. You can simply pick and choose whichever algorithm suits your use case best.
Either way, for overarching topic extraction I would definitely go for BERTopic.
@MaartenGr The problem is that I need to extract the topics for every single news headline isolated. Like summarise a news headline with just some words. Instead of topic mining over a large corpus...
@fabmeyer If you just need to summarize a news headline in isolation, then there is no need to do topic mining at all. You can just ask an LLM to do that for you. Something like this:
from torch import bfloat16
from transformers import pipeline
# Load LLM
pipe = pipeline(
"text-generation",
model="HuggingFaceH4/zephyr-7b-beta",
torch_dtype=bfloat16,
device_map="auto"
)
# Ask LLM to summarize a news headline
prompt = "Summarize this headline for me: [HEADLINE]."
outputs = pipe(prompt,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_p=0.95
)
print(outputs[0]["generated_text"])
You could also use KeyBERT and its newly released KeyLLM to ask for keywords/summarization or anything else in isolation.
@MaartenGr Yeah actually I am trying that out right now with KeyLLM + Mistral7b. Thanks again.
Hello landlord, I think our experiment may be the same program, I use BerTopic to cluster the theme of Chinese text, can you share your program code of wandb sweeps, thank you very much~
After hyperparameter sweep with wandb, I found the best hyperparameter and rerun the training:
However, this gives me the following error:
What should I do now? @MaartenGr