MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.97k stars 747 forks source link

Topic modeling regression in 0.14.0 with nr_topics #1043

Open damosuzuki opened 1 year ago

damosuzuki commented 1 year ago

I have noticed a reduction in the quality of topic modeling in 0.14.0 when specifying the nr_topics parameter.

Here is my test script:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

from bertopic.vectorizers import ClassTfidfTransformer

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(nr_topics=len(newsgroups_train['target_names']),
                        ctfidf_model=ctfidf_model,
                        calculate_probabilities=True)
topic_model.fit(newsgroups_train['data'])

print(topic_model.get_topic_info())

With bertopic==0.13.0:

    Topic  Count                                           Name
0      -1   4688  -1_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_for_on_you
1       0    700                             0_car_bike_cars_my
2       1    638                       1_drive_scsi_drives_disk
3       2    575                    2_gun_guns_militia_firearms
4       3    547                  3_key_encryption_clipper_chip
5       4    539                         4_team_hockey_550_game
6       5    527                 5_patients_msg_medical_disease
7       6    483                    6_year_baseball_pitching_he
8       7    405                       7_card_monitor_video_vga
9       8    375                  8_israel_turkish_jews_israeli
10      9    317                          9_ditto_ites_hello_hi
11     10    199                           10_god_jesus_hell_he
12     11    182               11_window_widget_colormap_server
13     12    173                    12_morality_truth_god_moral
14     13    172                    13_fbi_koresh_compound_batf
15     14    171                   14_amp_condition_scope_offer
16     15    141               15_atheists_atheism_god_universe
17     16    131                    16_printer_fonts_font_print
18     17    118                     17_ted_post_challenges_you
19     18    118                      18_windows_dos_cview_swap
20     19    115     19_xfree86_libxmulibxmuso_symbol_undefined

And with bertopic==0.14.0:

    Topic  Count                                               Name
0      -1   3334                                   -1_you_it_for_is
1       0   4402                                   0_for_with_on_be
2       1    620                       1_god_stephanopoulos_that_mr
3       2    559                      2_patients_medical_msg_health
4       3    437                          3_space_launch_nasa_lunar
5       4    436                     4_israel_were_turkish_armenian
6       5    376                                5_car_bike_cars_dog
7       6    296                        6_gun_guns_firearms_militia
8       7    230                     7_morality_objective_gay_moral
9       8    139               8_symbol_xterm_libxmulibxmuso_server
10      9    119                             9_printer_ink_print_hp
11     10     94                      10_requests_send_address_list
12     11     88                  11_radar_detector_detectors_radio
13     12     42                      12_church_pope_schism_mormons
14     13     40              13_ground_battery_grounding_conductor
15     14     36                        14_tax_taxes_deficit_income
16     15     24            15_marriage_married_ceremony_commitment
17     16     20  16_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_mg9vg9vg9vg...
18     17     12                              17_ditto_hello_hi_too
19     18     10                    18_professors_tas_phds_teaching
MaartenGr commented 1 year ago

That is indeed quite the difference! I had updated the underlying algorithm of nr_topics in order to prevent any topics to be merged in the outliers and was quite happy with the results but this seems to show something entirely different. I will test this a bit more in detail to see if the same thing happens with other datasets. If so, then it might be a bug or I might simply revert it to the old algorithm.