Open manmax31 opened 1 year ago
It might just be that the issue is from the "disappearance" of the -1 class through the outlier reduction. I would advise doing the following instead:
# Reduce outliers
new_topics = topic_model.reduce_outliers(
docs, topics, probabilities=probs, strategy="embeddings"
)
topic_model.update_topics(docs, topics=new_topics)
# Update the attribute that checks whether there are still outliers
topic_model._outliers = 0
# Set LLM labels
qwen_labels = [
label[0][0].split("\n")[0].strip()
for label in topic_model.get_topics(full=True)["Qwen"].values()
]
topic_model.set_topic_labels(qwen_labels)
I believe this is a known issue for which there is a PR available that I need to check a bit more in-depth.
Thank you but still throws the same error.
Could you check whether qwen_labels
indeed contains fewer labels than is found in topic_model.topic_labels_
?
It is the other way around: qwen_labels
has 1 more label than topic_model.topic_labels_
In that case, I would advise checking if the order of qwen_labels
matches with topic_model.topic_labels_
and topic_model.custom_labels_
. I expect the input of qwen_labels
to have one label too many which should be removed. I think it might be the outlier class which could be removed but you will have to check.
Hello,
I faced same problem.
How can I remove the outlier class from the qwen_lables
@Keamww2021 Simple remove the first outlier label from the list and I believe it should work. Do note though that it is difficult to say without seeing your exact code/versions/environment/etc.
I have been following your tutorial on how to use llama to get better topic names.
The only difference between yours and mine is that I am using Alibaba's Qwen 7b model which I find beats any 7b or 13b model. I am setting the labels after doing outlier reduction using
embeddings
strategy.The issue is: If I reduce outliers using
embeddings
, -1 topic goes away and hence I get the error:Make sure that topic_labels contains the same number of labels as that there are topics.
If I use
c-tf-idf
ordistributions
strategy to reduce outliers, there is no issue.Would you have any suggestions?
Here is the code: