Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized mode l

Excy-an commented 12 months ago

ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized mode l. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom device_map to from_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

JacopoMereu commented 11 months ago

Hi. I have this problem too with the recent versions. I assigned both the flag to True and the device_map to "auto", but nothing. It works if you try with older versions, but it would be cool to have it fixed even in the last versions.

robuno commented 11 months ago

Hi,

Are you sure your model is definitely not running on CPU? If you are sure that the memory is sufficient, I could not find a different solution to explain this RaiseError.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Darren80 commented 10 months ago

Hi. I have this problem too with the recent versions. I assigned both the flag to True and the device_map to "auto", but nothing. It works if you try with older versions, but it would be cool to have it fixed even in the last versions.

If you don't mind me asking, what version of bitsandbytes ended up working for you?

JacopoMereu commented 10 months ago

In a nutshell: 0.42.0 (runs ok), 0.39.0 and 0.41.2.post2 (they may work)

Too Long to read: @Darren80 Hello! It's been a while since I last worked on that code; I was just experimenting with Colab notebooks. During that time, I decided to drop the idea of locally fine-tuning a LoRa model and opted for GPT instead.

I tested those notebooks using the latest version of bitsandbytes (0.42.0), and they seem to be running smoothly without any errors.

N.B. I have also commented on some specific versions in my code: 0.39.0 and 0.41.2.post2. Don't ask me if they worked, it's been too much time and I don't remember it :(

In general, my approach involved searching for a YouTube video where someone used the same code I was working with. Assuming the code in the video ran without errors, and the person installed the latest package versions available at the time, I would note the video's date and then downgrade all my packages to versions predating that date.

2proveit commented 9 months ago

In a nutshell: 0.42.0 (runs ok), 0.39.0 and 0.41.2.post2 (they may work)

Too Long to read: @Darren80 Hello! It's been a while since I last worked on that code; I was just experimenting with Colab notebooks. During that time, I decided to drop the idea of locally fine-tuning a LoRa model and opted for GPT instead.

I tested those notebooks using the latest version of bitsandbytes (0.42.0), and they seem to be running smoothly without any errors.

N.B. I have also commented on some specific versions in my code: 0.39.0 and 0.41.2.post2. Don't ask me if they worked, it's been too much time and I don't remember it :(

In general, my approach involved searching for a YouTube video where someone used the same code I was working with. Assuming the code in the video ran without errors, and the person installed the latest package versions available at the time, I would note the video's date and then downgrade all my packages to versions predating that date.

i been using bitsandbytes 0.41.2.post2, still get the same problem, i use a RTX4090 to load a llama-2-7b to inference, using bnb config

load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,

model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            quantization_config=bnb_config,
        )

and i check the GPU memory and cpu memory occupation, only a small portion (2.4GB), i tried 0.42.0, still meets the same problem

AkshataDM commented 9 months ago

@JacopoMereu can you share the youtube link?

shashnkvats commented 9 months ago

I am facing the same issue. I am trying it on Kaggle (gives 2x T4 GPU) but for some reason, only one GPU seems to be utilized. The utilization on other GPU remains 0.

ikeatmck commented 8 months ago

I had the same issue but before this warning. I would get a complete OOM error from cuda and would not get any output. I was running this with out using pipeline function from transformers.

This is after attempting to clean up memory. Avoid using device.reset() it requires you to restart the service as it doesn't allow you to execute your code again until the service has been restarted.

jaaferklila commented 1 month ago

it dosn't work for me too

Falkensmaze0 commented 1 month ago

It seems like nothing's changed since Nov 2023. Getting the same error message: """_ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom device_map to from_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details._"""

Trying to load Nvidia/NVLM-D-72 locally, downloaded from Huggingface, running the following code:

import torch
from transformers import AutoModel, AutoTokenizer

path = "nvidia/NVLM-D-72B"
device_map = "auto"
model = AutoModel.from_pretrained(
    path,
    load_in_8bit=True,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True,
    device_map=device_map).eval()

Specs: Bitsandbytes Version: 0.43.3 OS: Ubuntu 24.04.1 LTS RAM: 128 GB DDR5 GPU: 1x RTX 4090 CPU: Intel 14900K

Any help would be greatly appreciated.

bitsandbytes-foundation / bitsandbytes

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized mode l #881