arcadiahero commented 5 months ago

I almost not change too many for the example code of the zero shot but has this error. Could you help me to solve it? Thanks. :from datasets import load_dataset

from bertopic import BERTopic from bertopic.representation import KeyBERTInspired

We select a subsample of 5000 abstracts from ArXiv

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"] docs = dataset["abstract"][:5_000]

We define a number of topics that we know are in the documents

zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

We fit our model using the zero-shot topics

and we define a minimum similarity. For each document,

if the similarity does not exceed that value, it will be used

for clustering instead.

topic_model = BERTopic( embedding_model="thenlper/gte-small", min_topic_size=15, zeroshot_topic_list=zeroshot_topic_list, zeroshot_min_similarity=.85, representationmodel=KeyBERTInspired() ) topics, = topic_model.fit_transform(docs)

MaartenGr commented 5 months ago

I believe this is a result of setting zershot_min_similarity too high. If you lower the value, the issue might resolve itself.

MaartenGr commented 5 months ago

Note that there is also a preliminary fix available at https://github.com/MaartenGr/BERTopic/pull/1762 which should resolve the issue entirely.

hubernst commented 5 months ago

Hello,

Zero-Shot is a perfect extension. Thanks so much you. Unfortunately, I have the same problem as described above. I have already added your fix #1688 to _bertopic.py. For a value zeroshot_min_similarity=0.2 or even 0.8 the code runs, in between success is unlikely. Do you have a solution?

`

All steps together

topic_model = BERTopic( verbose=True, min_topic_size = 20,

nr_topics = 5,

zeroshot_topic_list=kategorien_1,
zeroshot_min_similarity=.70,
embedding_model=embedding_model,          # Step 1 - Extract embeddings
umap_model=umap_model,                    # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations

)

2024-02-06 13:28:02,189 - BERTopic - Embedding - Transforming documents to embeddings. 100%|██████████| 1341/1341 [02:23<00:00, 9.35it/s] 2024-02-06 13:30:25,711 - BERTopic - Embedding - Completed ✓ 2024-02-06 13:30:25,713 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics 2024-02-06 13:30:26,642 - BERTopic - Zeroshot Step 1 - Completed ✓ 2024-02-06 13:30:26,643 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2024-02-06 13:30:29,187 - BERTopic - Dimensionality - Completed ✓ 2024-02-06 13:30:29,190 - BERTopic - Cluster - Start clustering the reduced embeddings 2024-02-06 13:30:29,207 - BERTopic - Cluster - Completed ✓ 2024-02-06 13:30:29,214 - BERTopic - Representation - Extracting topics from clusters using representation models. 2024-02-06 13:30:38,532 - BERTopic - Representation - Completed ✓ 2024-02-06 13:30:38,558 - BERTopic - Zeroshot Step 2 - Clustering documents that were not found in the zero-shot model... 2024-02-06 13:30:38,565 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2024-02-06 13:30:38,567 - BERTopic - Dimensionality - Completed ✓ 2024-02-06 13:30:38,577 - BERTopic - Cluster - Start clustering the reduced embeddings 2024-02-06 13:30:38,581 - BERTopic - Cluster - Completed ✓ 2024-02-06 13:30:38,587 - BERTopic - Representation - Extracting topics from clusters using representation models. 2024-02-06 13:31:33,230 - BERTopic - Representation - Completed ✓ 2024-02-06 13:31:33,298 - BERTopic - Zeroshot Step 2 - Completed ✓ 2024-02-06 13:31:33,299 - BERTopic - Zeroshot Step 3 - Combining clustered topics with the zeroshot model

IndexError Traceback (most recent call last) Input In [67], in <cell line: 2>() 1 #topics, probabilities = topic_model.fit_transform(sentences_nlp) ----> 2 topics, probabilities = topic_model.fit_transform(freitextantwort_list)

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in fit_transform(self, documents, embeddings, images, y) 446 # Combine Zero-shot with outliers 447 if self._is_zeroshot() and len(documents) != len(doc_ids): --> 448 predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings) 449 450 return predictions, self.probabilities_

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3553, in _combine_zeroshot_topics(self, documents, assigned_documents, embeddings) 3551 return self.topics, self.probabilities 3552 -> 3553 # Merge the two topic models 3554 merged_model = BERTopic.merge_models([zeroshot_model, self], min_similarity=1) 3555

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3166, in merge_models(cls, models, min_similarity, embedding_model) 3164 merged_topics["topic_aspects"][aspect][str(new_topic_val)] = values[str(new_topic)] 3165 -> 3166 # Add new embeddings 3167 new_tensors = tensors[new_topic - selected_topics["_outliers"]] 3168 merged_tensors = np.vstack([merged_tensors, new_tensors])

IndexError: index -2 is out of bounds for axis 0 with size 1`

Thanks alot

MaartenGr commented 5 months ago

@hubernst You mention using #1688 but the actual fix is found in #1762 which you should install through pip. Have you tried that? Make sure to start from a fresh and empty environment.

hubernst commented 5 months ago

Thanks for your realy quick response. It's terrible, but I'm in a network environment without a Git connection. That's why I customized _bertopic.py directly as specified in the fix... And sorry, of course #1762

MaartenGr commented 5 months ago

@hubernst Can you provide a reproducible example? You shared very limited code so it's unclear for example what is in representation_model or which versions you are using. Also, I get no issues using the code from the PR on my end using the examples in the related issues.

hubernst commented 4 months ago

Hi, thanks for your answer. I'm using bertopic in version 0.16.0 and python 3.10. My code looks like this,

# Step 1 - Extract embeddings
embedding_model = sentence_transformers.SentenceTransformer('/userfs/assets/data_asset/huggingface/paraphrase-multilingual-MiniLM-L12-v2')
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=10, n_components=10, min_dist=0.0, metric='cosine', random_state=42)
# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', cluster_selection_method='eom', prediction_data=False)
# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stopwords_german)
# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()
# All steps together
topic_model = BERTopic(
    verbose=True,
    min_topic_size = 30,
    #nr_topics = 5,
    zeroshot_topic_list=kategorien_1,
    zeroshot_min_similarity=.45,
    embedding_model=embedding_model,          # Step 1 - Extract embeddings
    umap_model=umap_model,                    # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
    vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
    representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)
topics = topic_model.fit_transform(freitextantwort_list)

2024-02-09 15:44:24,639 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%
42/42 [00:18<00:00, 4.04it/s]
2024-02-09 15:44:43,544 - BERTopic - Embedding - Completed ✓
2024-02-09 15:44:43,546 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2024-02-09 15:44:43,747 - BERTopic - Zeroshot Step 1 - Completed ✓
2024-02-09 15:44:43,748 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-09 15:44:56,807 - BERTopic - Dimensionality - Completed ✓
2024-02-09 15:44:56,808 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-09 15:44:56,835 - BERTopic - Cluster - Completed ✓
2024-02-09 15:44:56,841 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-09 15:44:58,442 - BERTopic - Representation - Completed ✓
2024-02-09 15:44:58,469 - BERTopic - Zeroshot Step 2 - Clustering documents that were not found in the zero-shot model...
2024-02-09 15:44:58,475 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-09 15:44:58,477 - BERTopic - Dimensionality - Completed ✓
2024-02-09 15:44:58,481 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-09 15:44:58,484 - BERTopic - Cluster - Completed ✓
2024-02-09 15:44:58,490 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-09 15:45:10,157 - BERTopic - Representation - Completed ✓
2024-02-09 15:45:10,231 - BERTopic - Zeroshot Step 2 - Completed ✓
2024-02-09 15:45:10,232 - BERTopic - Zeroshot Step 3 - Combining clustered topics with the zeroshot model

IndexError Traceback (most recent call last) Input In [55], in <cell line: 2>() 1 #topics, probabilities = topic_model.fit_transform(sentences_nlp) ----> 2 topics = topic_model.fit_transform(freitextantwort_list)

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in BERTopic.fit_transform(self, documents, embeddings, images, y) 446 # Combine Zero-shot with outliers 447 if self._is_zeroshot() and len(documents) != len(doc_ids): --> 448 predictions = self._combine_zeroshot_topics(documents, assigned_documents, assignedembeddings) 450 return predictions, self.probabilities

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3554, in BERTopic._combine_zeroshot_topics(self, documents, assigneddocuments, embeddings) 3551 return self.topics, self.probabilities_ 3553 # Merge the two topic models -> 3554 merged_model = BERTopic.merge_models([zeroshot_model, self], min_similarity=1) 3556 # Update topic labels and representative docs of the zero-shot model 3557 for topic in range(len(set(y))):

File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3167, in BERTopic.merge_models(cls, models, min_similarity, embedding_model) 3164 merged_topics["topic_aspects"][aspect][str(new_topic_val)] = values[str(new_topic)] 3166 # Add new embeddings -> 3167 new_tensors = tensors[new_topic - selected_topics["_outliers"]] 3168 merged_tensors = np.vstack([merged_tensors, new_tensors]) 3170 # Topic Mapper

IndexError: index -2 is out of bounds for axis 0 with size 1

It works if I am not using zero-shot topic modeling.

Many greetings

MaartenGr commented 4 months ago

I think this issue then relates to https://github.com/MaartenGr/BERTopic/issues/1797 which should be relatively straightforward to fix. I would advise keeping an eye on that issue until a fix is released.

MaartenGr commented 4 months ago

@hubernst

I created a PR in https://github.com/MaartenGr/BERTopic/pull/1804 that should solve both issues, the ordering of the embeddings as well as moving the outlier class back to the 0th position (which is necessary for many other functions).

Could you test whether it works for you?

hubernst commented 4 months ago

Hello, yes, of course I will check it, thank you for the fix! Hopefully today, tomorrow afternoon at the latest.

hubernst commented 4 months ago

Hi, thanks for the quick help. For the problem described here, the fix #1804 works! I.e. I can now specify different values for zeroshot_min_similarity. Unfortunately the fix does not solve issue #1792, I can also comment on that there. Furthermore, there is an error with topics_per_class(). Sorry.

MaartenGr commented 4 months ago

Glad to hear that it resolved at least this issue ;-) I added my response to that specific issue there.

MaartenGr / BERTopic

IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749

We select a subsample of 5000 abstracts from ArXiv

We define a number of topics that we know are in the documents

We fit our model using the zero-shot topics

and we define a minimum similarity. For each document,

if the similarity does not exceed that value, it will be used

for clustering instead.

All steps together

nr_topics = 5,