MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.16k stars 764 forks source link

ValueError: zero-size array to reduction operation maximum which has no identity #1010

Closed zhimin-z closed 1 year ago

zhimin-z commented 1 year ago

After hyperparameter sweep with wandb, I found the best hyperparameter and rerun the training:

from contextualized_topic_models.evaluation.measures import InvertedRBO, TopicDiversity, CoherenceCV, CoherenceNPMI, CoherenceUMASS, CoherenceUCI
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

# output the best topic model

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=10, n_components=8,
                  metric='manhattan', low_memory=False)

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN()

# Step 4 - Tokenize topics
vectorizer_model = TfidfVectorizer(stop_words="english", ngram_range=(1, 3))

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Step 6 - (Optional) Fine-tune topic representations with a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
    embedding_model=embedding_model,            # Step 1 - Extract embeddings
    umap_model=umap_model,                      # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model,                # Step 3 - Cluster reduced embeddings
    vectorizer_model=vectorizer_model,          # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model,                  # Step 5 - Extract topic words
    representation_model=representation_model,  # Step 6 - (Optional) Fine-tune topic represenations
    # verbose=True                              # Step 7 - Track model stages
)

df_issues = pd.read_json(os.path.join(
    path_labeling, 'issues_topic_modeling.json'))
docs = df_issues['Issue_preprocessed_content_gpt_summary'].tolist()

topic_model = topic_model.fit(docs)
topic_model.save(os.path.join(path_labeling_best, 'Topic model'))

fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_labeling_best, 'Topic visualization.html'))

hierarchical_topics = topic_model.hierarchical_topics(docs)
embeddings = embedding_model.encode(docs, show_progress_bar=False)
reduced_embeddings = umap_model.transform(embeddings)
fig = topic_model.visualize_hierarchical_documents(
    docs, hierarchical_topics=hierarchical_topics, embeddings=reduced_embeddings)
fig.write_html(os.path.join(path_labeling_best,
               'Hierarchical document visualization.html'))

fig = topic_model.visualize_barchart()
fig.write_html(os.path.join(path_labeling_best, 'Term visualization.html'))

fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(path_labeling_best,
               'Topic similarity visualization.html'))

fig = topic_model.visualize_term_rank()
fig.write_html(os.path.join(path_labeling_best,
               'Term score decline visualization.html'))

info_df = topic_model.get_topic_info()
info_df

However, this gives me the following error:

Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?a886ccd6-1f3a-4a0d-a397-b2b759bf4dc6)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-63eab4a34363> in <module>
     38 topic_model.save(os.path.join(path_labeling_best, 'Topic model'))
     39 
---> 40 fig = topic_model.visualize_topics()
     41 fig.write_html(os.path.join(path_labeling_best, 'Topic visualization.html'))
     42 

~\AppData\Roaming\Python\Python39\site-packages\bertopic\_bertopic.py in visualize_topics(self, topics, top_n_topics, width, height)
   2035         """
   2036         check_is_fitted(self)
-> 2037         return plotting.visualize_topics(self,
   2038                                          topics=topics,
   2039                                          top_n_topics=top_n_topics,

~\AppData\Roaming\Python\Python39\site-packages\bertopic\plotting\_topics.py in visualize_topics(topic_model, topics, top_n_topics, custom_labels, title, width, height)
     71     embeddings = topic_model.c_tf_idf_.toarray()[indices]
     72     embeddings = MinMaxScaler().fit_transform(embeddings)
---> 73     embeddings = UMAP(n_neighbors=2, n_components=2, metric='hellinger', random_state=42).fit_transform(embeddings)
     74 
     75     # Visualize with plotly

~\AppData\Roaming\Python\Python39\site-packages\umap\umap_.py in fit_transform(self, X, y)
   2770             Local radii of data points in the embedding (log-transformed).
...
---> 40     return umr_maximum(a, axis, None, out, keepdims, initial, where)
     41 
     42 def _amin(a, axis=None, out=None, keepdims=False,

ValueError: zero-size array to reduction operation maximum which has no identity

What should I do now? @MaartenGr

zhimin-z commented 1 year ago

These previous posts https://github.com/MaartenGr/BERTopic/issues/727, https://github.com/MaartenGr/BERTopic/issues/725, https://github.com/MaartenGr/BERTopic/issues/378 seems no help to me. I cannot visualize the topic but I did successfully run the topic modeling. BTW, the dataset is a small one (~345 paragraphs and each having fewer than 3000 words).

zhimin-z commented 1 year ago

Does it have anything to do with too few topics? image

MaartenGr commented 1 year ago

That indeed might be the case, with only 2 topics each datapoint cannot take 2 nearest neighbors as defined in UMAP. I believe in order to run this correctly, you would need to have at least 3 topics. Perhaps it would be worthwhile to tweak HDBSCAN a bit in order to create more topics. For example, by lowering min_cluster_size. Having said that, with small datasets, I would typically recommend something like k-Means, which allows you to set n_clusters, instead as that can capture clusters for smaller datasets a bit more straightforwardly.

fabmeyer commented 12 months ago

@zhimin-z @MaartenGr could one of you resolve this issue? If yes with which parameters?

I need a way to extract topics or keywords from short news headlines like

Is that even possible with BERTopic?

MaartenGr commented 12 months ago

@fabmeyer If you are running into the issue of having too few topics, then you can use the min_topic_size parameter for that. Reducing that value will increase the number of topics. If you are using a custom HDBSCAN model, then you can use min_cluster_size for that. Finally, if you are interested in extracting keywords without needing some overarching topics, you can use KeyBERT instead.

fabmeyer commented 12 months ago

@MaartenGr Thanks for your fast reply Maarten. I rather neeed something like overarching topics. I have seen that you also have a version that can run with LLMs. Which of your many libaries is the best for overarching topic extraction/mining? :D

MaartenGr commented 12 months ago

@fabmeyer No problem! It depends on the size of your data. If you just have a couple of documents (e.g., < 100) then it would make sense to either just label the documents yourself or use something like KeyBERT. For that amount of data, I'm not sure whether there is actually a use case for topic modeling. However, it could definitely still work with a clustering model like k-Means in BERTopic.

For larger datasets, BERTopic is definitely something that fits within most use cases due to its modular nature. You can simply pick and choose whichever algorithm suits your use case best.

Either way, for overarching topic extraction I would definitely go for BERTopic.

fabmeyer commented 12 months ago

@MaartenGr The problem is that I need to extract the topics for every single news headline isolated. Like summarise a news headline with just some words. Instead of topic mining over a large corpus...

MaartenGr commented 12 months ago

@fabmeyer If you just need to summarize a news headline in isolation, then there is no need to do topic mining at all. You can just ask an LLM to do that for you. Something like this:

from torch import bfloat16
from transformers import pipeline

# Load LLM
pipe = pipeline(
    "text-generation", 
    model="HuggingFaceH4/zephyr-7b-beta", 
    torch_dtype=bfloat16, 
    device_map="auto"
)

# Ask LLM to summarize a news headline
prompt = "Summarize this headline for me: [HEADLINE]."
outputs = pipe(prompt, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=0.1, 
    top_p=0.95
)
print(outputs[0]["generated_text"])

You could also use KeyBERT and its newly released KeyLLM to ask for keywords/summarization or anything else in isolation.

fabmeyer commented 12 months ago

@MaartenGr Yeah actually I am trying that out right now with KeyLLM + Mistral7b. Thanks again.

wjx-alalala commented 7 months ago

Hello landlord, I think our experiment may be the same program, I use BerTopic to cluster the theme of Chinese text, can you share your program code of wandb sweeps, thank you very much~