MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.83k stars 724 forks source link

Topic Modeling with LLMs #1659

Open AAA2023AAA opened 7 months ago

AAA2023AAA commented 7 months ago

Hello,

I'm Ahmad who asked you in Linked. Thank you for Response. I have colab pro plus. I try to print statistics about my dataset: count 21006.000000 mean 490.140341 std 217.119297 min 1.000000 25% 340.000000 50% 471.000000 75% 610.000000 max 3163.000000 Name: article_body_length, dtype: float64

How to solve out of memory issues? Thank you in advance.

MaartenGr commented 7 months ago

It's difficult to say without seeing the full code. Could you share your full code? Also, which GPU did you select?

AAA2023AAA commented 7 months ago

I use A100. I tried to use the same code that you have provided for Arvix papers and LLAMA 2 : 'meta-llama/Llama-2-13b-chat-hf' https://towardsdatascience.com/topic-modeling-with-llama-2-85177d01e174

MaartenGr commented 7 months ago

You did not change a single thing? For the sake of simplicity, could you still directly copy-paste your code? You never know what the issue might be, even when preparing your dataset!

Lastly, when exactly do you get the out-of-memory error? Is that during .fit?

MaartenGr commented 7 months ago

By the way, it might also be worthwhile to checkout the official documentation for a number of other examples running LLMs with BERTopic. There are also a couple of pre-quantized models explored: https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#zephyr-mistral-7b

AAA2023AAA commented 7 months ago

I have only changed the dataset cell: import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/Article_body.csv") docs = df["Article Body"].fillna('').astype(str)

MaartenGr commented 7 months ago

docs = df["Article Body"].fillna('').astype(str)

First, make sure that your docs are a list of strings and not a pandas series. Second, when exactly do you get the out-of-memory error? Is that during .fit? If so, which steps were already completed before giving the error? You can see this in the logs when running .fit.

AAA2023AAA commented 7 months ago

yes I use: docs = docs.tolist()

the error appeared during .fit_transform. This is the error: 2023-12-02 15:04:01,540 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2023-12-02 15:04:58,427 - BERTopic - Dimensionality - Completed ✓ 2023-12-02 15:04:58,431 - BERTopic - Cluster - Start clustering the reduced embeddings 2023-12-02 15:05:06,908 - BERTopic - Cluster - Completed ✓ 2023-12-02 15:05:06,910 - BERTopic - Representation - Extracting topics from clusters using representation models. 0%| | 0/23 [00:00<?, ?it/s]

OutOfMemoryError Traceback (most recent call last) in <cell line: 21>() 19 20 # Train model ---> 21 topics, probs = topic_model.fit_transform(docs, embeddings)

28 frames /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in softmax(input, dim, _stacklevel, dtype) 1856 ret = input.softmax(dim) 1857 else: -> 1858 ret = input.softmax(dim, dtype=dtype) 1859 return ret 1860

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.05 GiB. GPU 0 has a total capacty of 15.77 GiB of which 2.49 GiB is free. Process 11204 has 13.28 GiB memory in use. Of the allocated memory 8.85 GiB is allocated by PyTorch, and 3.20 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

MaartenGr commented 7 months ago

In that case, it might be worthwhile to try out a quantized model like the link I shared above. It seems that it needs more VRAM than anticipated for your use case, so either increasing VRAM (e.g., 24GB) or using a smaller model would be the fix for you.

Virginie74 commented 6 months ago

Hi Maarten,

Thank you for the excellent work done with bertopic.

Until now, I had been using Bertopic v0.15 with Llama2 for representation, and it was working very well. However, I decided to upgrade to v0.16 to test the new functionalities, such as zero-shot modeling.

Now, with the same code and data, I encounter an OutOfMemoryError (CUDA out of memory). The error message is as follows: "Tried to allocate 3.38 GiB. GPU 0 has a total capacity of 11.99 GiB, of which 1.02 GiB is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory, 8.93 GiB is allocated by PyTorch, and 57.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF."

Do you have any idea what could be causing this change in memory usage between versions 0.15 and 0.16?

MaartenGr commented 6 months ago

@Virginie74 There were no changes with respect to memory issues between those two versions. My guess is, especially with LLMs, that the underlying package that loads in the model was updated (or the LLM itself). For instance, it may happen that new versions of transformers, sbert, etc. take in a bit more VRAM that causes the OOM errors. I would advise use quantized models instead.

crookedreyes commented 6 months ago

I'm facing the same issue using Llama 2, using the same example on https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing , using Bertopic 0.16

but if I change !pip install bertopic datasets accelerate bitsandbytes xformers adjustText to !pip install bertopic==0.15 datasets accelerate bitsandbytes xformers adjustText

works without problem

MaartenGr commented 6 months ago

@crookedreyes I just checked the changelogs again and the only thing that was changed that might affect CUDA memory was the automatic truncation of documents. If you set doc_lenght=100 and tokenizer="char", do you then still get this issue?

crookedreyes commented 6 months ago

@crookedreyes I just checked the changelogs again and the only thing that was changed that might affect CUDA memory was the automatic truncation of documents. If you set doc_lenght=100 and tokenizer="char", do you then still get this issue?

@MaartenGr Thank you for your response and help.

It worked! , I just added the parameters into the textgenerator()

llama2 = TextGeneration(generator, prompt=prompt,doc_length=10,tokenizer="char")