Open ankitkr3 opened 1 month ago
Thanks for sharing. Could you add the full error log? Without it, it is difficult for me to say where exactly it is going wrong. Also, did you make that the documents are a list and not a pandas series?
Full Error:
`OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[11], line 1
----> 1 topics, probs = topic_model.fit_transform(docs)
File ~/Library/Python/3.12/lib/python/site-packages/bertopic/_bertopic.py:457, in BERTopic.fit_transform(self, documents, embeddings, images, y)
453 documents, embeddings, assigned_documents, assigned_embeddings = self._zeroshot_topic_modeling(
454 documents, embeddings
455 )
456 # Filter UMAP embeddings to only non-assigned embeddings to be used for clustering
--> 457 umap_embeddings = self.umap_model.transform(embeddings)
459 if len(documents) > 0: # No zero-shot topics matched
460 # Cluster reduced embeddings
461 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
File ~/Library/Python/3.12/lib/python/site-packages/umap/umap_.py:2935, in UMAP.transform(self, X, force_all_finite)
2933 X = check_array(X, dtype=np.uint8, order="C", force_all_finite=force_all_finite)
2934 else:
-> 2935 X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C", force_all_finite=force_all_finite)
2936 x_hash = joblib.hash(X)
2937 if x_hash == self._input_hash:
File ~/Library/Python/3.12/lib/python/site-packages/sklearn/utils/validation.py:1087, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
1085 n_samples = _num_samples(array)
1086 if n_samples < ensure_min_samples:
-> 1087 raise ValueError(
1088 "Found array with %d sample(s) (shape=%s) while a"
1089 " minimum of %d is required%s."
1090 % (n_samples, array.shape, ensure_min_samples, context)
1091 )
1093 if ensure_min_features > 0 and array.ndim == 2:
1094 n_features = array.shape[1]
ValueError: Found array with 0 sample(s) (shape=(0, 1536)) while a minimum of 1 is required.`
Data is in list of strings format like '[ 'Mahesh babu home tour | Mahesh babu dattata village |#maheshbabu #superstar', 'ROSHAN LATEST HIMACHALI SONG 2021 VINAY SAGAR ATUL SHARMA',]
@MaartenGr ??
@ankitkr3 I want to help everyone out as much as possible on this repository but I should mention that I am just a single developer providing all this work for free. This means it might take me a couple of days to respond since I work on this in the evenings and weekends.
Replying with ??
feels like my work and effort on this are not appreciated, so I ask you to be patient in the future.
Regarding the issue, it seems that the structure of the embeddings is the main problem which might be a result of either the format of the documents or the embedding model. Can you try it again without using embedding_model
? This helps me understand whether it is the openAI backend that is the issue.
Also, I see that you use df['title'].values
which gives back a numpy array and not a list if I'm not mistaken. If you indeed passed a list of strings, then I wonder whether you indeed used df['title'].values
and not something else. Either way, perhaps using df['title'].values.tolist()
solves the issue.
Extremely sorry if you felt that way, let me try it in that way and let you know.
On Wed, 9 Oct 2024 at 4:39 PM, Maarten Grootendorst < @.***> wrote:
@ankitkr3 https://github.com/ankitkr3 I want to help everyone out as much as possible on this repository but I should mention that I am just a single developer providing all this work for free. This means it might take me a couple of days to respond since I work on this in the evenings and weekends.
Replying with ?? feels like my work and effort on this are not appreciated, so I ask you to be patient in the future.
Regarding the issue, it seems that the structure of the embeddings is the main problem which might be a result of either the format of the documents or the embedding model. Can you try it again without using embedding_model? This helps me understand whether it is the openAI backend that is the issue.
Also, I see that you use df['title'].values which gives back a numpy array and not a list if I'm not mistaken. If you indeed passed a list of strings, then I wonder whether you indeed used df['title'].values and not something else. Either way, perhaps using df['title'].values.tolist() solves the issue.
— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/2168#issuecomment-2402020926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJ6TTVXJTAMNXZKLON6II3Z2UFGVAVCNFSM6AAAAABPL2ZWK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSGAZDAOJSGY . You are receiving this because you were mentioned.Message ID: @.***>
@MaartenGr still getting the same error after trying with df['title'].values.tolist(), can there be anything else
I can think of two other things.
First, have you tried it with a non-OpenAI embedding model? So simply without using the embedding_model
at all for example.
Second, there might be an issue with zero-shot topic modeling for which I just pushed a new release that includes a fix. Using BERTopic v0.16.4 might help.
I'm with BERTopic v0.16.4 and the error persists
@yanivc-jfrog Did you try it with a with a non-OpenAI embedding model? So simply without using the embedding_model at all for example.
Also, do you perhaps have a reproducible example that I can test locally?
Yes, I did it with this embedding model "avsolatorio/GIST-small-Embedding-v0" using SentenceTransformer, but even "sentence-transformers/all-mpnet-base-v2" returned the same problem. I also tried removing all parameters and just called it with all the defaults.
topic_model = BERTopic()
topic, topic_proba = topic_model.fit_transform(['I am going home', 'I am going to the store'])
The above still returns that warning - and also never finishes (stopped manually after 3 minutes)
topic_model = BERTopic() topic, topic_proba = topic_model.fit_transform([ 'I am going home', 'I am going to the supermarket', 'I am going to the gym', 'I am going to the store', 'I am going to the court for a legal issue', 'I am going to the groceries store', 'I am going to the football court', 'I am going to the basketball court', 'I am going to the supermarket', 'I am going to the gym', ]) The above works for me in Google Colab (without a warning), but not in my Jupyter notebook locally (warning + stuck forever), though both (Colab & my local notebook) have v0.16.4
@yanivc-jfrog
The above still returns that warning - and also never finishes (stopped manually after 3 minutes)
What warning? The OP mentions and error and not a warning. So how could the model then never finish if it encounters an error?
Can you please create a full example, including the entire code and error log?
The above works for me in Google Colab (without a warning), but not in my Jupyter notebook locally (warning + stuck forever), though both (Colab & my local notebook) have v0.16.4
Have you tried installing BERTopic from a completely fresh environment? Based on your description, it seems there are issues with your environment. Starting new typically helps.
Have you searched existing issues? 🔎
Desribe the bug
Getting a value error for undefined reason, ValueError: Found array with 0 sample(s) (shape=(0, 1536)) while a minimum of 1 is required.
i have checked embeddings are working fine for test results.
Reproduction
BERTopic Version
0.16.3