MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

always crash on Mac #1014

Closed zhimin-z closed 1 year ago

zhimin-z commented 1 year ago

I am running bertopic on my Mac Air laptop (M2, 16GB) but it 100% crashed:

from contextualized_topic_models.evaluation.measures import InvertedRBO, TopicDiversity, CoherenceCV, CoherenceNPMI, CoherenceUMASS, CoherenceUCI
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

# output the best topic model

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=16, n_components=4,
                  metric='manhattan')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN()

# Step 4 - Tokenize topics
vectorizer_model = TfidfVectorizer(stop_words="english", ngram_range=(1, 3))

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Step 6 - (Optional) Fine-tune topic representations with a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
    embedding_model=embedding_model,            # Step 1 - Extract embeddings
    umap_model=umap_model,                      # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model,                # Step 3 - Cluster reduced embeddings
    vectorizer_model=vectorizer_model,          # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model,                  # Step 5 - Extract topic words
    representation_model=representation_model,  # Step 6 - (Optional) Fine-tune topic represenations
    # verbose=True                              # Step 7 - Track model stages
)

df_issues = pd.read_json(os.path.join(
    path_labeling, 'issues_topic_modeling.json'))
docs = df_issues['Issue_original_content_gpt_summary'].tolist()

topic_model = topic_model.fit(docs)
topic_model.save(os.path.join(path_labeling_best, 'Topic model'))

fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_labeling_best, 'Topic visualization.html'))

fig = topic_model.visualize_barchart()
fig.write_html(os.path.join(path_labeling_best, 'Term visualization.html'))

fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(path_labeling_best,
               'Topic similarity visualization.html'))

fig = topic_model.visualize_term_rank()
fig.write_html(os.path.join(path_labeling_best,
               'Term score decline visualization.html'))

hierarchical_topics = topic_model.hierarchical_topics(docs)
fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.write_html(os.path.join(path_labeling_best,
               'Hierarchical clustering visualization.html'))

embeddings = embedding_model.encode(docs, show_progress_bar=False)
fig = topic_model.visualize_documents(docs, embeddings=embeddings)
fig.write_html(os.path.join(path_labeling_best,
               'Document visualization.html'))

info_df = topic_model.get_topic_info()
info_df

Here is the log message:

The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

Canceled future for execute_request message before replies were done

Did anyone get an idea why this always happens? I appreciate it in advance! BTW, the same code never crashed on my Windows 10 PC (i7-4790, 16GB, GTX 745). Does it have anything to do with GPU/CUDA? @MaartenGr Do you have any idea?

MaartenGr commented 1 year ago

Could you track at what piece of code it crashes? That would make it much easier to figure out what is happening. Also, make sure to have a pytorch version installed that works for M2.

zhimin-z commented 1 year ago

@MaartenGr I found out it crashed when I execute embedding_model = SentenceTransformer("all-mpnet-base-v2"). But when I tried with:

import torch
x = torch.rand(5, 3)
print(x)

It indeed works as expected:

tensor([[0.9452, 0.9088, 0.9812],
        [0.1468, 0.4544, 0.8565],
        [0.6701, 0.7626, 0.1618],
        [0.1539, 0.1461, 0.1831],
        [0.7187, 0.3056, 0.4588]])

Also when I delete embedding_model from BerTopic and fit with my data, it also immediately crashed. The crash always happens even when I use BerTopic without any custom configuration: BERTopic().fit(docs). My dataset is just 2MB. I am using Python 3.9.13 and bertopic==0.14.0. I tried to downgrade to bertopic==0.13.0, no luck still crashed. Any other suggestion perhaps?

zhimin-z commented 1 year ago

This is log from Jupyter:

[W 01:34:00.620 NotebookApp] Unhandled error
warn 01:34:00.624: Error occurred while trying to start the kernel, options.disableUI=true Ap [Error]: 
    at new pn (/Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:2:1628691)
    at new Ap (/Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:17:127546)
    at /Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:17:278397
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

Error: Unhandled error
    at Function.create (/Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:2:68927)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Object.t.startSession (/Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:2:81011)
    at d.startNew (/Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:2:77195) {
  category: 'unknown',
  originalException: t [Error]: Unhandled error
      at Function.create (/Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:2:68927)
      at processTicksAndRejections (node:internal/process/task_queues:96:5)
      at Object.t.startSession (/Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:2:81011)
      at d.startNew (/Users/jimmy/.vscode/extensions/ms-toolsai.jupyter-2023.1.2010391206/out/extension.node.js:2:77195) {
    response: L [Response] {
      size: 0,
      timeout: 0,
      [Symbol(Body internals)]: [Object],
      [Symbol(Response internals)]: [Object]
    },
    traceback: '',
    vslsStack: [ CallSite {}, CallSite {}, CallSite {}, CallSite {} ]
  }
}
zhimin-z commented 1 year ago

I deploy the same script in Docker, and crash did not happen:

FROM python:3.9.13

COPY . .

RUN pip3 install -r requirements.txt

CMD ["python", "main.py"] 

Since Docker default OS is Linux rather than MacOS, I wonder if this is due to incompatible issues caused by M2 chip?

MaartenGr commented 1 year ago

@zhimin-z The issue might indeed be caused by the M2 chip. However, I believe that there is now GPU support for ARM-based Mac processors (M1 & M2) in PyTorch, so making sure you have the right version of PyTorch here would be key.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!