MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

Bug in get_topic_info? #581

Closed drob-xx closed 2 years ago

drob-xx commented 2 years ago

I'm formatting output to use in a graph. I have two new BERTopic instances, similar setting except that the input text and the hdbscan_models are different. Here is what I get when calling get_topic_info()

BERT_1.get_topic_info().to_dict()['Name']

{0: '-1_people_new_years_year',
 1: '0_police_told_court_mr',
 2: '1_new_like_just_people',
 3: '2_people_government_military_president',
 4: '3_mr_labour_people_year',
 5: '4_dog_animals_animal_dogs',
 6: '5_president_obama_election_trump',
 7: '6_storm_weather_snow_water',
 8: '7_apple_facebook_users_new',
 9: '8_plane_flight_passengers_ship',
 10: '9_mexico_president_government_mexican'}
BERT_2.get_topic_info().to_dict()['Name']

{0: '0_league_club_game_season',
 1: '1_gold_race_olympic_world',
 2: '2_murray_tennis_open_wimbledon',
 3: '3_hamilton_race_rosberg_mercedes',
 4: '4_golf_mcilroy_open_woods',
 5: '5_fight_mayweather_boxing_champion',
 6: '-1_yn_pistorius_steenkamp_ar'}

Note that in the first output the -1 topic is the first, and in the second the last. In the second model the number of -1 is very small - smaller than all the others. I think the bug is on line 769 of _bertopic:

        info = pd.DataFrame(self.topic_sizes.items(), columns=['Topic', 'Count']).sort_values("Count", ascending=False)
        info["Name"] = info.Topic.map(self.topic_names)

In this case:

info = pd.DataFrame(BERT_2.topic_sizes.items(), columns=['Topic', 'Count']).sort_values("Count", ascending=False) info["Name"] = info.Topic.map(BERT_2.topic_names) info

produces:


0   0   3746    0_league_club_game_season
1   1   503 1_gold_race_olympic_world
2   2   284 2_murray_tennis_open_wimbledon
3   3   175 3_hamilton_race_rosberg_mercedes
4   4   161 4_golf_mcilroy_open_woods
5   5   113 5_fight_mayweather_boxing_champion
6   -1  69  -1_yn_pistorius_steenkamp_ar

Sorting by Topic/ascending fixes this - but may break something else. I didn't trace back to figure out if the topic list is already re-ordered in all cases or not.

info = pd.DataFrame(BERT_2.topic_sizes.items(), columns=['Topic', 'Count']).sort_values("Topic", ascending=True) info["Name"] = info.Topic.map(BERT_2.topic_names) info

6   -1  69  -1_yn_pistorius_steenkamp_ar
0   0   3746    0_league_club_game_season
1   1   503 1_gold_race_olympic_world
2   2   284 2_murray_tennis_open_wimbledon
3   3   175 3_hamilton_race_rosberg_mercedes
4   4   161 4_golf_mcilroy_open_woods
5   5   113 5_fight_mayweather_boxing_champion
MaartenGr commented 2 years ago

Thanks for sharing this! In part, the goal of get_topic_info is to output the topics in a frequency-based descending manner so that you can quickly see which topics are the largest. For that reason, I sorted the values simply by the count but it seems that I did not account for the outlier topic being smaller than most other topics. I think sorting by "Topic" should work but I am not sure if there are also exceptions to this, especially when updating topics. Typically, the topics are sorted by their frequency (except for -1) and thus sorting by topics should not be an issue in most cases.

drob-xx commented 2 years ago

I understand. It isn't critical - but this was unexpected behavior since the topics are re-numbered sequentially from -1 to n where the 0 to n are in order of size. It is entirely possible to have a -1 that is the smallest or in the middle. I'm not sure that the assumption is warranted. While it is minor, when it happens it is easy to miss and if you are relying on the order can mess things up.

MaartenGr commented 2 years ago

Agreed, I think I will change this in the upcoming release with your suggestion, simply sorting by "Topic". I will also test it out by reducing topics and/or updating them after having fitted a model once to see what happens there.

MaartenGr commented 2 years ago

This was fixed in v0.11 and this issue will be closed. If you continue to run into this problem, let me know and I'll make sure to re-open the issue.