MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.21k stars 767 forks source link

Using represenation_model in online learning #1628

Closed Ceglowa closed 6 months ago

Ceglowa commented 1 year ago

Hi, I am running an online learning scenario with River and OpenAI as representation_model. I have the following code:

representation_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True)
umap_model = IncrementalPCA()
vectorizer_model = OnlineCountVectorizer(stop_words="english")
cluster_model = River(cluster.DBSTREAM())
ctfidf_model = ClassTfidfTransformer()

online_topic_model = BERTopic(umap_model=umap_model,
                                       hdbscan_model=cluster_model,
                                       ctfidf_model=ctfidf_model,
                                       vectorizer_model=vectorizer_model,
                                       representation_model=representation_model,
                                       verbose=True)

topics = []
for chunk in chunks(full_sentences, 5):
    online_topic_model.partial_fit(chunk)
    topics.extend(online_topic_model.topics_)
online_topic_model.topics_ = topics

And I checked the prompts that are being sent to OpenAI. I found that each partial_fit is using the documents of this particular iteration. That makes sense in a topic modelling scenario. However, it makes the represenation_model somewhat problematic. It may come a situation, when a topic doesn't have any documents in a specific iteration. Then, a prompt that is being sent doesn't have any documents in it.

Is there some possibility to make it that the representation_model is using all documents (also those ones from the past)?

MaartenGr commented 1 year ago

Is there some possibility to make it that the representation_model is using all documents (also those ones from the past)?

That is not possible. The thing is that incremental learning, or online learning, assumes a method of training that does not require keeping track of the documents. In incremental settings, it is uncommon that all data will be continuously tracked as that might result in memory errors or models that are unyieldy if you store the data in the models.

Instead, it might be worthwhile to check out the newly released .merge_models functionality in the main branch. It can behave similar to partial_fit but since it was not made specifically for incremental learning, it can do a bunch more. You can find more information about that here.

Ceglowa commented 1 year ago

The merging functionality seems interesting. However, I have questions to that. Do you believe that it will help with the represenation_model problem? If I have a topic_1 that is trained on 5k documents and topic_2 that is trained on a different set of 5k documents then each of those models has a different representation of the topics (in terms of titles of the topics). Then how would merging work here?

To make my use-case more clear: I am expecting a constant flow of new documents that I want to group up. I don't know how many topics will be there. And also there will be new topics coming in. Additionally, I want to generate a title for each topic. However, since new data is coming in I am thinking about making the titles smarter and smarter with each new iteration. So, for example for iteration 1 for topic X, the prompt that is being sent has 5 representative documents. Then, next iteration I would want the prompt to be better. So, instead of 5 representative documents, it has now 10 representative documents.

MaartenGr commented 1 year ago

Do you believe that it will help with the represenation_model problem? If I have a topic_1 that is trained on 5k documents and topic_2 that is trained on a different set of 5k documents then each of those models has a different representation of the topics (in terms of titles of the topics). Then how would merging work here?

Topics will be merged depending on their similarity. If the topics are similar enough to one another, then they will be merged. If the topics are dissimilar, they will be added to the original model as new topics. You can control the similarity measure.

To make my use-case more clear: I am expecting a constant flow of new documents that I want to group up. I don't know how many topics will be there. And also there will be new topics coming in. Additionally, I want to generate a title for each topic. However, since new data is coming in I am thinking about making the titles smarter and smarter with each new iteration. So, for example for iteration 1 for topic X, the prompt that is being sent has 5 representative documents. Then, next iteration I would want the prompt to be better. So, instead of 5 representative documents, it has now 10 representative documents.

.merge_models is mainly used to combine two trained topic models in an attempt to add new topics to a previously trained model. Your use case would be possible if you simply changed the order in which you combine models. If you have model_a and model_b and model_b is the improved one, then simply use .merge_models([model_b, model_a]) to have topics from model_a be integrated into model_b.

For a more in-depth description, check out this documentation.

Ceglowa commented 6 months ago

Hi @MaartenGr. I managed to make use of your new feature of merging models. Thanks for the recommendation back then. Merging of the models worked really great for my use-case. Closing the ticket.