Closed drob-xx closed 2 years ago
Thanks for sharing this! In part, the goal of get_topic_info
is to output the topics in a frequency-based descending manner so that you can quickly see which topics are the largest. For that reason, I sorted the values simply by the count but it seems that I did not account for the outlier topic being smaller than most other topics. I think sorting by "Topic"
should work but I am not sure if there are also exceptions to this, especially when updating topics. Typically, the topics are sorted by their frequency (except for -1) and thus sorting by topics should not be an issue in most cases.
I understand. It isn't critical - but this was unexpected behavior since the topics are re-numbered sequentially from -1 to n where the 0 to n are in order of size. It is entirely possible to have a -1 that is the smallest or in the middle. I'm not sure that the assumption is warranted. While it is minor, when it happens it is easy to miss and if you are relying on the order can mess things up.
Agreed, I think I will change this in the upcoming release with your suggestion, simply sorting by "Topic"
. I will also test it out by reducing topics and/or updating them after having fitted a model once to see what happens there.
This was fixed in v0.11 and this issue will be closed. If you continue to run into this problem, let me know and I'll make sure to re-open the issue.
I'm formatting output to use in a graph. I have two new BERTopic instances, similar setting except that the input text and the hdbscan_models are different. Here is what I get when calling get_topic_info()
Note that in the first output the -1 topic is the first, and in the second the last. In the second model the number of -1 is very small - smaller than all the others. I think the bug is on line 769 of _bertopic:
In this case:
info = pd.DataFrame(BERT_2.topic_sizes.items(), columns=['Topic', 'Count']).sort_values("Count", ascending=False) info["Name"] = info.Topic.map(BERT_2.topic_names) info
produces:
Sorting by Topic/ascending fixes this - but may break something else. I didn't trace back to figure out if the topic list is already re-ordered in all cases or not.
info = pd.DataFrame(BERT_2.topic_sizes.items(), columns=['Topic', 'Count']).sort_values("Topic", ascending=True) info["Name"] = info.Topic.map(BERT_2.topic_names) info