MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.97k stars 747 forks source link

partial_fit throws warning and inf scores #2098

Open IsaacGreenMachine opened 1 month ago

IsaacGreenMachine commented 1 month ago

Have you searched existing issues? 🔎

Desribe the bug

running partial_fit starts to throw error after ~100 iterations

~/.venv/lib/python3.12/site-packages/bertopic/vectorizers/_ctfidf.py:84 RuntimeWarning: overflow encountered in divide
  idf = np.log((avg_nr_samples / df) + 1)

Reproduction

I'm on an M3 MacBook Pro Python 3.12.4 scikit-learn 1.5.1 bertopic 0.16.3 numpy 1.26.4 scipy 1.14.0

here is a slightly modified version of the partial_fit example from the docs:

from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer

# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)

topic_model = BERTopic(umap_model=umap_model,
                       hdbscan_model=cluster_model,
                       vectorizer_model=vectorizer_model
                       )

batch_size = 128
for idx in range(0, len(df), batch_size):
    docs = df["message"].iloc[idx:idx+batch_size]
    embeds = embeddings[idx:idx+batch_size]
    topic_model.partial_fit(list(docs.astype('string')), embeds)

BERTopic Version

0.16.3

IsaacGreenMachine commented 1 month ago

here's the output, btw (please ignore the numbers for the cluster names) scores are infinity

here's the topic info after fully training with partial_fit on my dataset:

topic_model.get_topic(1, full=True)
{'Main': [('1054375', inf),
  ('982395', inf),
  ('1013164', inf),
  ('1031508', inf),
  ('1031576', inf),
  ('957591', inf),
  ('1043766', inf),
  ('1054256', inf),
  ('1054349', inf),
  ('1054355', inf)]}
topic_model.get_topic_info()
    Topic   Count   Name    Representation  Representative_Docs
0   0   109104  0_30470_32665_14_29 [30470, 32665, 14, 29, 31, 10, 40, 44, 53, 45]  NaN
1   1   52972   1_1054375_982395_1013164_1031508    [1054375, 982395, 1013164, 1031508, 1031576, 9...   NaN
2   2   129841  2_369045_369865_371828_371766   [369045, 369865, 371828, 371766, 365563, 36557...   NaN
3   3   66873   3_27_13_16_38   [27, 13, 16, 38, 44, 3350, 46, 3574, 3831, 4222]    NaN
4   4   41935   4_12_22_31_14   [12, 22, 31, 14, 10, 43, 39, 42, 37, 38]    NaN
5   5   67877   5_220939_220545_222224_220700   [220939, 220545, 222224, 220700, 220669, 21894...   NaN
6   6   48487   6_1767968_1593953_1683623_1593883   [1767968, 1593953, 1683623, 1593883, 1683534, ...   NaN
7   7   35557   7_552517_545624_543607_552708   [552517, 545624, 543607, 552708, 543675, 55253...   NaN
8   8   75309   8_14_15_13_16   [14, 15, 13, 16, 42, 38, 29, 53, 10, 3675]  NaN
9   9   77294   9_218852_220508_218868_218942   [218852, 220508, 218868, 218942, 220498, 15, 3...   NaN
10  10  60717   10_599079_574468_571526_588418  [599079, 574468, 571526, 588418, 598261, 59877...   NaN
11  11  46438   11_4756774_4756616_4747992_4756619  [4756774, 4756616, 4747992, 4756619, 4756573, ...   NaN
12  12  91285   12_31_42_3350_40    [31, 42, 3350, 40, 14, 3389, 3381, 48, 3727, 44]    NaN
13  13  67976   13_387158_391193_392310_384981  [387158, 391193, 392310, 384981, 392368, 39232...   NaN
14  14  60478   14_19_13_38_41  [19, 13, 38, 41, 10, 44, 48, 52, 3727, 4204]    NaN
15  15  60942   15_240524_241155_240871_241228  [240524, 241155, 240871, 241228, 241243, 24117...   NaN
16  16  87910   16_218382_10_14_28  [218382, 10, 14, 28, 27, 29, 31, 37, 43, 39]    NaN
17  17  22748   17_849849_815686_839704_826626  [849849, 815686, 839704, 826626, 795510, 84975...   NaN
18  18  58772   18_517017_498821_515931_476445  [517017, 498821, 515931, 476445, 516308, 51588...   NaN
19  19  56241   19_220610_220836_39_13  [220610, 220836, 39, 13, 44, 48, 38, 14, 10, 51]    NaN
20  20  110401  20_28305_16_10_12   [28305, 16, 10, 12, 27, 37, 29, 22, 41, 44] NaN
21  21  121051  21_15_17_12_27  [15, 17, 12, 27, 10, 18, 14, 37, 43, 42]    NaN
22  22  53491   22_32963_27_37_3381 [32963, 27, 37, 3381, 38, 3532, 3574, 3404, 39...   NaN
23  23  58245   23_30470_10_37_38   [30470, 10, 37, 38, 44, 29, 48, 3727, 4254, 4331]   NaN
24  24  39084   24_444630_437352_436133_440889  [444630, 437352, 436133, 440889, 436174, 43615...   NaN
25  25  13123   25_2883254_2842667_2866698_2866738  [2883254, 2842667, 2866698, 2866738, 2866690, ...   NaN
26  26  60774   26_239877_240079_240123_240453  [239877, 240079, 240123, 240453, 240659, 24070...   NaN
27  27  27280   27_1160426_1115483_1147040_1110211  [1160426, 1115483, 1147040, 1110211, 1110213, ...   NaN
28  28  70162   28_368293_364473_365277_364744  [368293, 364473, 365277, 364744, 365791, 36457...   NaN
29  29  78156   29_224484_222741_222735_224365  [224484, 222741, 222735, 224365, 222872, 22404...   NaN
30  30  39348   30_29260_31_18_12   [29260, 31, 18, 12, 43, 10, 42, 44, 3389, 53]   NaN
31  31  59289   31_33739_212113_212106_12   [33739, 212113, 212106, 12, 28, 43, 31, 14, 33...   NaN
32  32  79654   32_27230_27_14_10   [27230, 27, 14, 10, 22, 29, 38, 37, 41, 3350]   NaN
33  33  76397   33_222872_223204_14_10  [222872, 223204, 14, 10, 16, 22, 37, 47, 44, 29]    NaN
34  34  70535   34_30267_17_10_43   [30267, 17, 10, 43, 27, 47, 31, 37, 3397, 29]   NaN
35  35  40708   35_33454_13_16_43   [33454, 13, 16, 43, 29, 22, 41, 46, 52, 3574]   NaN
36  36  33484   36_30470_28_16_13   [30470, 28, 16, 13, 3389, 44, 10, 48, 37, 4222] NaN
37  37  72767   37_222875_222950_223145_223221  [222875, 222950, 223145, 223221, 224365, 22330...   NaN
38  38  36732   38_253412_251896_253272_253500  [253412, 251896, 253272, 253500, 253488, 25324...   NaN
39  39  42591   39_250962_250920_248635_251452  [250962, 250920, 248635, 251452, 248993, 24901...   NaN
40  40  17007   40_4714314_4699011_4714064_4714090  [4714314, 4699011, 4714064, 4714090, 4714056, ...   NaN
41  41  68058   41_27321_27363_27393_27407  [27321, 27363, 27393, 27407, 14, 17, 18, 29, 2...   NaN
42  42  94062   42_27321_27411_27552_27259  [27321, 27411, 27552, 27259, 16, 14, 31, 27, 2...   NaN
43  43  79166   43_15_10_29_28  [15, 10, 29, 28, 3350, 37, 27, 3397, 53, 3381]  NaN
44  44  79278   44_879543_839601_849938_850763  [879543, 839601, 849938, 850763, 849858, 87072...   NaN
45  45  54102   45_224506_223904_224478_224046  [224506, 223904, 224478, 224046, 222820, 23759...   NaN
46  46  65796   46_14_27_17_29  [14, 27, 17, 29, 37, 38, 48, 3555, 52, 3582]    NaN
47  47  75014   47_38_52_29_3515    [38, 52, 29, 3515, 3695, 3636, 10, 3831, 3989,...   NaN
48  48  50618   48_251880_249120_251382_251243  [251880, 249120, 251382, 251243, 250920, 25040...   NaN
49  49  42643   49_239463_239393_238194_239129  [239463, 239393, 238194, 239129, 239184, 23920...   NaN
MaartenGr commented 1 month ago

I'm not entirely sure why this happens but it might be worthwhile to use the decay or delete_min_df parameter to prevent certain counts from blowing up. It would be worth a shot.