Open 27pchrisl opened 1 year ago
In online topic modeling, specifically .partial_fit
the extraction of representative documents was actually never meant to be supported. The reason for this is that the idea of representative documents changes over time as new clusters are found and old clusters are updated. It is not by default that a user has access to all documents at all times, so re-calculating representative documents is therefore not straightforward.
Moreover, the use of private functions is not recommended and will not be supported for development later on. The reason for this is that private functions are free to change over time and are not part of the public functionality of BERTopic. Any use of private functions is at the risk of the user.
Hi,
I'm using the online topic modelling with River, calling like this (eg for a list of 200 docs)
After the second call I want to check if new clusters were generated by River, and if so retrieve the representative docs for the new clusters. BERtopic does not take this action after partial_fit, so I am running it manually. However, running the save method the second time results in an exception:
This looks to be caused because there are zero samples in a cluster that was generated in the first
partial_fit
call in the second set of docs. In this case it would be great if the library skipped the empty cluster, and just emitted the representative docs it does have.