MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Zero shot topic modelling #2168

Open ankitkr3 opened 1 month ago

ankitkr3 commented 1 month ago

Have you searched existing issues? 🔎

Desribe the bug

Getting a value error for undefined reason, ValueError: Found array with 0 sample(s) (shape=(0, 1536)) while a minimum of 1 is required.

i have checked embeddings are working fine for test results.

import openai
from bertopic.representation import OpenAI
from bertopic import BERTopic
from langchain.embeddings import OpenAIEmbeddings

# Then use the following
my_key = "12323m2em2rm,2lr,2.f,."
client = openai.OpenAI(api_key = my_key)

embedding_model = OpenAIBackend(client, "text-embedding-ada-002")

summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a description of this topic in a one statement in the following format:
topic: <description>
"""

# embedding_model = OpenAIBackend(client, "text-embedding-ada-002")

representation_model = OpenAI(client = client, model="gpt-4o", chat=True, prompt=summarization_prompt, 
                              nr_docs=5, delay_in_seconds=3)

vectorizer_model = CountVectorizer(min_df=1)
topic_model = BERTopic(
    embedding_model=embedding_model, 
    min_topic_size=25,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=0,
    representation_model=representation_model
)

topics =topic_model.fit_transform(df['title'].values)

Reproduction

from bertopic import BERTopic

BERTopic Version

0.16.3

MaartenGr commented 1 month ago

Thanks for sharing. Could you add the full error log? Without it, it is difficult for me to say where exactly it is going wrong. Also, did you make that the documents are a list and not a pandas series?

ankitkr3 commented 1 month ago

Full Error:

`OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 1
----> 1 topics, probs = topic_model.fit_transform(docs)

File ~/Library/Python/3.12/lib/python/site-packages/bertopic/_bertopic.py:457, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    453     documents, embeddings, assigned_documents, assigned_embeddings = self._zeroshot_topic_modeling(
    454         documents, embeddings
    455     )
    456     # Filter UMAP embeddings to only non-assigned embeddings to be used for clustering
--> 457     umap_embeddings = self.umap_model.transform(embeddings)
    459 if len(documents) > 0:  # No zero-shot topics matched
    460     # Cluster reduced embeddings
    461     documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)

File ~/Library/Python/3.12/lib/python/site-packages/umap/umap_.py:2935, in UMAP.transform(self, X, force_all_finite)
   2933     X = check_array(X, dtype=np.uint8, order="C", force_all_finite=force_all_finite)
   2934 else:
-> 2935     X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C", force_all_finite=force_all_finite)
   2936 x_hash = joblib.hash(X)
   2937 if x_hash == self._input_hash:

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/utils/validation.py:1087, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
   1085     n_samples = _num_samples(array)
   1086     if n_samples < ensure_min_samples:
-> 1087         raise ValueError(
   1088             "Found array with %d sample(s) (shape=%s) while a"
   1089             " minimum of %d is required%s."
   1090             % (n_samples, array.shape, ensure_min_samples, context)
   1091         )
   1093 if ensure_min_features > 0 and array.ndim == 2:
   1094     n_features = array.shape[1]

ValueError: Found array with 0 sample(s) (shape=(0, 1536)) while a minimum of 1 is required.`

Data is in list of strings format like '[ 'Mahesh babu home tour | Mahesh babu dattata village |#maheshbabu #superstar', 'ROSHAN LATEST HIMACHALI SONG 2021 VINAY SAGAR ATUL SHARMA',]

ankitkr3 commented 1 month ago

@MaartenGr ??

MaartenGr commented 1 month ago

@ankitkr3 I want to help everyone out as much as possible on this repository but I should mention that I am just a single developer providing all this work for free. This means it might take me a couple of days to respond since I work on this in the evenings and weekends.

Replying with ?? feels like my work and effort on this are not appreciated, so I ask you to be patient in the future.

Regarding the issue, it seems that the structure of the embeddings is the main problem which might be a result of either the format of the documents or the embedding model. Can you try it again without using embedding_model? This helps me understand whether it is the openAI backend that is the issue.

Also, I see that you use df['title'].values which gives back a numpy array and not a list if I'm not mistaken. If you indeed passed a list of strings, then I wonder whether you indeed used df['title'].values and not something else. Either way, perhaps using df['title'].values.tolist() solves the issue.

ankitkr3 commented 1 month ago

Extremely sorry if you felt that way, let me try it in that way and let you know.

On Wed, 9 Oct 2024 at 4:39 PM, Maarten Grootendorst < @.***> wrote:

@ankitkr3 https://github.com/ankitkr3 I want to help everyone out as much as possible on this repository but I should mention that I am just a single developer providing all this work for free. This means it might take me a couple of days to respond since I work on this in the evenings and weekends.

Replying with ?? feels like my work and effort on this are not appreciated, so I ask you to be patient in the future.

Regarding the issue, it seems that the structure of the embeddings is the main problem which might be a result of either the format of the documents or the embedding model. Can you try it again without using embedding_model? This helps me understand whether it is the openAI backend that is the issue.

Also, I see that you use df['title'].values which gives back a numpy array and not a list if I'm not mistaken. If you indeed passed a list of strings, then I wonder whether you indeed used df['title'].values and not something else. Either way, perhaps using df['title'].values.tolist() solves the issue.

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/2168#issuecomment-2402020926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJ6TTVXJTAMNXZKLON6II3Z2UFGVAVCNFSM6AAAAABPL2ZWK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSGAZDAOJSGY . You are receiving this because you were mentioned.Message ID: @.***>

ankitkr3 commented 1 month ago

@MaartenGr still getting the same error after trying with df['title'].values.tolist(), can there be anything else

MaartenGr commented 1 month ago

I can think of two other things.

First, have you tried it with a non-OpenAI embedding model? So simply without using the embedding_model at all for example.

Second, there might be an issue with zero-shot topic modeling for which I just pushed a new release that includes a fix. Using BERTopic v0.16.4 might help.

yanivc-jfrog commented 2 weeks ago

I'm with BERTopic v0.16.4 and the error persists

MaartenGr commented 2 weeks ago

@yanivc-jfrog Did you try it with a with a non-OpenAI embedding model? So simply without using the embedding_model at all for example.

Also, do you perhaps have a reproducible example that I can test locally?

yanivc-jfrog commented 2 weeks ago

Yes, I did it with this embedding model "avsolatorio/GIST-small-Embedding-v0" using SentenceTransformer, but even "sentence-transformers/all-mpnet-base-v2" returned the same problem. I also tried removing all parameters and just called it with all the defaults.

topic_model = BERTopic()
topic, topic_proba = topic_model.fit_transform(['I am going home', 'I am going to the store'])

The above still returns that warning - and also never finishes (stopped manually after 3 minutes)

yanivc-jfrog commented 2 weeks ago

topic_model = BERTopic() topic, topic_proba = topic_model.fit_transform([ 'I am going home', 'I am going to the supermarket', 'I am going to the gym', 'I am going to the store', 'I am going to the court for a legal issue', 'I am going to the groceries store', 'I am going to the football court', 'I am going to the basketball court', 'I am going to the supermarket', 'I am going to the gym', ]) The above works for me in Google Colab (without a warning), but not in my Jupyter notebook locally (warning + stuck forever), though both (Colab & my local notebook) have v0.16.4

MaartenGr commented 2 weeks ago

@yanivc-jfrog

The above still returns that warning - and also never finishes (stopped manually after 3 minutes)

What warning? The OP mentions and error and not a warning. So how could the model then never finish if it encounters an error?

Can you please create a full example, including the entire code and error log?

The above works for me in Google Colab (without a warning), but not in my Jupyter notebook locally (warning + stuck forever), though both (Colab & my local notebook) have v0.16.4

Have you tried installing BERTopic from a completely fresh environment? Based on your description, it seems there are issues with your environment. Starting new typically helps.