MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.76k stars 716 forks source link

`transform` method not handling single embeddings or strings given to it. #2018

Open 1jamesthompson1 opened 1 month ago

1jamesthompson1 commented 1 month ago

Issue

Currently the transform method of the BERTopic has to recieve a list of strings as the documents and if embeddings are present then it needs to be a 2darray of shape (len(documents, embedding_dimension). This requires extra reshape calls when trying to call the transform method on a single document.

For example it would be nicer to do this:

document = 'This is a really interesting document'

document_embedding = np.random.rand(768)

topic_model.transform(document, document_embedding)
Example of what you currently have to do ```python document = 'This is a really interesting document' document_embedding = np.random.rand(768) topic_model.transform([document], document_embedding.reshape(1,-1)) ```

Currently if you run it you get this error.

BFile ~/code/BERTopic/bertopic/_utils.py:55, in check_embeddings_shape(embeddings, docs)
      else:
         if embeddings.shape[0] != len(docs):
--->         raise ValueError("Make sure that the embeddings are a numpy array with shape: "
                               "(len(docs), vector_dim) where vector_dim is the dimensionality "
                               "of the vector embeddings. ")

ValueError: Make sure that the embeddings are a numpy array with shape: (len(docs), vector_dim) where vector_dim is the dimensionality of the vector embeddings.

Why I think it is happening

This is because the embeddings shape check inside transform happens before:

  1. the document argument is converted into a list if it is str
  2. The embedding is reshaped from a 1d array to 2d array (Note that it currently never does this).

Thoughts on making it better Am I missing something here about np arrays or the transform function? It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.

I have already made the change on a fork for my uses did you agree with the idea and would you like a PR?

MaartenGr commented 1 month ago

It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.

It's interesting but in my experience, it rarely happens that users try it on a single document and also pass an embedding. Transforming a single document to an embedding is incredibly cheap, so there is no need to then also pass the embedding. I think that's why this issue hasn't been caught until now if I remember most issues of the last couple of years well enough.

Either way, a fix should indeed be necessary here to transform an embedding into the right shape when there is only one document.