Error in transform probabilities

anirban-mu commented 8 months ago

I periodically seem to encounter the following error:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 550, in transform
    probabilities = self._map_probabilities(probabilities, original_topics=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 4124, in _map_probabilities
    mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 14 is out of bounds for axis 1 with size 14

I am unsure of how to help debug it because it only appears in some runs and not others. In each case there is a BERTopic model of the form BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True), I have fitted the model successfully using fit_transform, and then called transform to compute topics and probabilities in a new sample. In addition, in each case, I provide both the documents and the embeddings. The code operates over a collection of sets of documents so its run as follows:

for key in topic_models:
    topics[key], _ = topic_models[key].fit_transform(datasets[key], embeddings[key])

I know the models fit successfully because I can obtain topics from them and there does not seem to be an error. It is only when calling transform that an error periodically manifests. Its stochastic appearance suggests it has something to do with the fitted topics but I am entirely unclear as to how to debug.

In this code:

# Map array of probabilities (probability for assigned topic per document)
        if probabilities is not None:
            if len(probabilities.shape) == 2:
                mapped_probabilities = np.zeros((probabilities.shape[0],
                                                 len(set(mappings.values())) - self._outliers))
                for from_topic, to_topic in mappings.items():
                    if to_topic != -1 and from_topic != -1:
                        mapped_probabilities[:, to_topic] += probabilities[:, from_topic]

                return mapped_probabilities

        return probabilities

Is to_topic guaranteed to be sequential? Could there be a gap in the indices? I don't know the code base well enough but len(set(mappings.values())) may be the issue? Maybe something like:

if probabilities is not None:
    if len(probabilities.shape) == 2:
        # Find the maximum 'to_topic' index, ensuring the array is large enough
        max_to_topic = max(mappings.values())

        # Initialize 'mapped_probabilities' with a size based on the maximum index found
        mapped_probabilities = np.zeros((probabilities.shape[0], max_to_topic + 1 - self._outliers))

        for from_topic, to_topic in mappings.items():
            if to_topic != -1 and from_topic != -1:
                # Safely add probabilities, knowing 'mapped_probabilities' has enough columns
                mapped_probabilities[:, to_topic] += probabilities[:, from_topic]

        # If necessary, additional steps to handle outliers or resize the array can be added here

        return mapped_probabilities

In this code, the case of non sequential indices is handled naturally. I do not, however, know if non sequential indices are symptomatic of a deeper issue. HTH.

I should note that I am unclear of exactly what was going on with self._outliers so I left it in. Maybe this should be max_to_topic + 1? That is what I would have done without the self._outliers but I left self._outliers in because I don't understand (have not had the time to look that carefully) what it is.