MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.03k stars 757 forks source link

Empty topic representation for online topic modeling #1222

Closed florine-henriot closed 1 year ago

florine-henriot commented 1 year ago

Hello Maarten!

I am trying to cluster articles with the online topic modeling method. I have separated my dataset by months, and I am doing a .partial_fit() on each month of my dataset. I am using IncrementalPCA, MiniBatchKMeans and Online Count Vectorizer. Once I have trained the model on all of my dataset, I noticed that I have some topics without any representation and no keywords, and I realized that it happens when I have no new documents added to a topic. For example, this is the topic representation I have for one partial fit :

image

And then what I have for the next partial fit :

image

I noticed that no new documents have been adeed to the topic 8, so I have no representation for the topic. And every time I get no topic representation is when no new documents is added to the topic. I tried to change the decay and min_df parameter, but I still get the same problem. I have not found anything to fix this issue. Do you know if there is a way to fix this ?

Thank you!

MaartenGr commented 1 year ago

That is strange indeed! Could you share your entire code? That will make it a bit easier to see what is happening here.

florine-henriot commented 1 year ago

Thank you for your quick answer! Sure, here is the code for my BERTopic model : image

And here is my code for the partial_fit method: image

I'm just writing a txt file at each iteration to see what is happening at each iteration.

jburdo1 commented 1 year ago

Following on with this, I think there may be an issue with the decay argument in OnlineCountVectorizer. For my model when I remove that argument, the vectors are generated to completion. When I include decay=.01 (as florine has above), I get a "RuntimeWarning:verflow encountered in true_divide" along with numpy errors following on that seem to (maybe?) indicate a null or nan value within a vector.

MaartenGr commented 1 year ago

That is how the decay parameter works. It reduces the counts in the bag-of-word matrix with each iteration to make sure that newer information has more weight. It might be that the value of .01 is simply too high for the frequency with which there is being trained. Either lowering the value further or simply not setting it should solve the issue.

florine-henriot commented 1 year ago

Hi, thank you for your answers. I tried without setting the decay parameter, and the results are even worse than when I put decay = .01. Here are the results for my first partial_fit :

image

And here are the results for my second partial_fit:

image

I also tried with decay = 0, and I have similar results for the second fit.

I also have another question: is there a mapping between the results obtained from one partial_fit to another? If I remember correctly, I saw that there was a maping between the different topic representations, but I noticed that the keywords I obtain are quite different. For example, here are the keywords I obtain for the topic 0 of my first partial_fit :

{0: [('year', 0.05152180849788245), ('community', 0.046929186455780814), ('digital', 0.03635118657061755), ('tiger', 0.036274630674348975), ('chinese', 0.036274630674348975), ('mechanical', 0.03424785892664652), ('transformation', 0.0338970895067121), ('team', 0.0337541109955536), ('collective', 0.030963910840698373), ('digitalisation', 0.030963910840698373)]}

And here are the ones I get for my second partial_fit :

{0: [('child', 0.03843448424455639), ('wellbeing', 0.02900759247136183), ('match', 0.02289947465156453), ('follow', 0.02278532690216033), ('opportunity', 0.02239565155539829), ('launch', 0.02071000515270328), ('football', 0.01619938539992005), ('predecessor', 0.01619938539992005), ('physiological', 0.01619938539992005), ('ill', 0.01619938539992005)]}

I got those results without setting the decay and min_df parameters and only one document has been added to the topic between the two partial fit. Is it actually supposed to change that much or am I doing something wrong? Or is it because I'm not setting the min_df and decay parameter?

I also tried to change the topic representations myself, meaning that I added a condition: if I end up with an empty specific topic representation, I get back this specific topic representation from the partial fit from before. In that way, I have a list of tuples with the old keywords and the old c-tf-idf scores. Then, I change the topic representation obtained with model.get_topics(). As a result, the dictionnary I obtain has a no empty representations. But when I want to print model.get_topic_info, I still get an empty line in the dataframe. Is there a way to also change this, so I can have the old topic representation in the dataframe obtained with model.get_topic_info() ?

I would really appreciate some help with this, Thanks a lot!

MaartenGr commented 1 year ago

Just to be sure, the code is exactly as you mentioned in your previous message except for omitting the decay parameter? The rest is the same?

If so, then it might be worthwhile to minimize the code that you currently have and try to create a minimal example. For example, perhaps the nr_topics="auto" is the culprit here. Omitting that might help. Same with n_gram_range, top_n_words, etc.

florine-henriot commented 1 year ago

Thanks for your answer. Yes, I am using exactly the same code as I mentioned before. I tried to test each parameter one by one, and I found out what is the problem, it's not because of the decay parameter I'm using, it's because I'm using the Part Of Speech representation model. I don't have this problem without it.

MaartenGr commented 1 year ago

Glad to hear that you resolved the issue! Most likely, there were not sufficient documents for each topic for the PartOfSpeech model to accurately extract the words in each partial fit.

noahberhe commented 1 year ago

Interesting... it looks like you shouldn't use any representation models for Online learning in that case, I had the same with KeyBert which is similarly remedied by removing.

Is there a way to use the representation model the first time I use partial_fit() but then switch it off for subsequent iterations?

MaartenGr commented 1 year ago

@noahberhe That is currently not possible but I do not think using it the first time during partial fit would be the solution here. I believe after having trained your model by passing all data through several partial fits, you can then use update_topics with any representation model to get better representations of your topics.