MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.16k stars 764 forks source link

Representative docs requests may fail after multiple rounds of partial_fit in online topic modelling #1620

Open 27pchrisl opened 1 year ago

27pchrisl commented 1 year ago

Hi,

I'm using the online topic modelling with River, calling like this (eg for a list of 200 docs)

first = docs[:100]
second = docs[100:]

self.topic_model.partial_fit(first)
self.topic_model._save_representative_docs(docs)

self.topic_model.partial_fit(second)
self.topic_model._save_representative_docs(docs)

After the second call I want to check if new clusters were generated by River, and if so retrieve the representative docs for the new clusters. BERtopic does not take this action after partial_fit, so I am running it manually. However, running the save method the second time results in an exception:

        if ensure_min_samples > 0:
            n_samples = _num_samples(array)
            if n_samples < ensure_min_samples:
>               raise ValueError(
                    "Found array with %d sample(s) (shape=%s) while a"
                    " minimum of %d is required%s."
                    % (n_samples, array.shape, ensure_min_samples, context)
                )
E               ValueError: Found array with 0 sample(s) (shape=(0, 1018)) while a minimum of 1 is required by the normalize function.

.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:967: ValueError

This looks to be caused because there are zero samples in a cluster that was generated in the first partial_fit call in the second set of docs. In this case it would be great if the library skipped the empty cluster, and just emitted the representative docs it does have.

MaartenGr commented 1 year ago

In online topic modeling, specifically .partial_fit the extraction of representative documents was actually never meant to be supported. The reason for this is that the idea of representative documents changes over time as new clusters are found and old clusters are updated. It is not by default that a user has access to all documents at all times, so re-calculating representative documents is therefore not straightforward.

Moreover, the use of private functions is not recommended and will not be supported for development later on. The reason for this is that private functions are free to change over time and are not part of the public functionality of BERTopic. Any use of private functions is at the risk of the user.