Closed Ceglowa closed 6 months ago
Is there some possibility to make it that the representation_model is using all documents (also those ones from the past)?
That is not possible. The thing is that incremental learning, or online learning, assumes a method of training that does not require keeping track of the documents. In incremental settings, it is uncommon that all data will be continuously tracked as that might result in memory errors or models that are unyieldy if you store the data in the models.
Instead, it might be worthwhile to check out the newly released .merge_models
functionality in the main branch. It can behave similar to partial_fit
but since it was not made specifically for incremental learning, it can do a bunch more. You can find more information about that here.
The merging functionality seems interesting. However, I have questions to that.
Do you believe that it will help with the represenation_model
problem? If I have a topic_1
that is trained on 5k documents and topic_2
that is trained on a different set of 5k documents then each of those models has a different representation of the topics (in terms of titles of the topics). Then how would merging work here?
To make my use-case more clear: I am expecting a constant flow of new documents that I want to group up. I don't know how many topics will be there. And also there will be new topics coming in. Additionally, I want to generate a title for each topic. However, since new data is coming in I am thinking about making the titles smarter and smarter with each new iteration. So, for example for iteration 1 for topic X, the prompt that is being sent has 5 representative documents. Then, next iteration I would want the prompt to be better. So, instead of 5 representative documents, it has now 10 representative documents.
Do you believe that it will help with the represenation_model problem? If I have a topic_1 that is trained on 5k documents and topic_2 that is trained on a different set of 5k documents then each of those models has a different representation of the topics (in terms of titles of the topics). Then how would merging work here?
Topics will be merged depending on their similarity. If the topics are similar enough to one another, then they will be merged. If the topics are dissimilar, they will be added to the original model as new topics. You can control the similarity measure.
To make my use-case more clear: I am expecting a constant flow of new documents that I want to group up. I don't know how many topics will be there. And also there will be new topics coming in. Additionally, I want to generate a title for each topic. However, since new data is coming in I am thinking about making the titles smarter and smarter with each new iteration. So, for example for iteration 1 for topic X, the prompt that is being sent has 5 representative documents. Then, next iteration I would want the prompt to be better. So, instead of 5 representative documents, it has now 10 representative documents.
.merge_models
is mainly used to combine two trained topic models in an attempt to add new topics to a previously trained model. Your use case would be possible if you simply changed the order in which you combine models. If you have model_a
and model_b
and model_b
is the improved one, then simply use .merge_models([model_b, model_a])
to have topics from model_a
be integrated into model_b
.
For a more in-depth description, check out this documentation.
Hi @MaartenGr. I managed to make use of your new feature of merging models. Thanks for the recommendation back then. Merging of the models worked really great for my use-case. Closing the ticket.
Hi, I am running an online learning scenario with River and OpenAI as representation_model. I have the following code:
And I checked the prompts that are being sent to OpenAI. I found that each
partial_fit
is using the documents of this particular iteration. That makes sense in a topic modelling scenario. However, it makes the represenation_model somewhat problematic. It may come a situation, when a topic doesn't have any documents in a specific iteration. Then, a prompt that is being sent doesn't have any documents in it.Is there some possibility to make it that the
representation_model
is using all documents (also those ones from the past)?