Closed Excy-an closed 10 months ago
Hi. I have this problem too with the recent versions. I assigned both the flag to True and the device_map to "auto", but nothing. It works if you try with older versions, but it would be cool to have it fixed even in the last versions.
Hi,
Are you sure your model is definitely not running on CPU? If you are sure that the memory is sufficient, I could not find a different solution to explain this RaiseError.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hi. I have this problem too with the recent versions. I assigned both the flag to True and the device_map to "auto", but nothing. It works if you try with older versions, but it would be cool to have it fixed even in the last versions.
If you don't mind me asking, what version of bitsandbytes ended up working for you?
In a nutshell: 0.42.0 (runs ok), 0.39.0 and 0.41.2.post2 (they may work)
Too Long to read: @Darren80 Hello! It's been a while since I last worked on that code; I was just experimenting with Colab notebooks. During that time, I decided to drop the idea of locally fine-tuning a LoRa model and opted for GPT instead.
I tested those notebooks using the latest version of bitsandbytes (0.42.0), and they seem to be running smoothly without any errors.
N.B. I have also commented on some specific versions in my code: 0.39.0 and 0.41.2.post2. Don't ask me if they worked, it's been too much time and I don't remember it :(
In general, my approach involved searching for a YouTube video where someone used the same code I was working with. Assuming the code in the video ran without errors, and the person installed the latest package versions available at the time, I would note the video's date and then downgrade all my packages to versions predating that date.
In a nutshell: 0.42.0 (runs ok), 0.39.0 and 0.41.2.post2 (they may work)
Too Long to read: @Darren80 Hello! It's been a while since I last worked on that code; I was just experimenting with Colab notebooks. During that time, I decided to drop the idea of locally fine-tuning a LoRa model and opted for GPT instead.
I tested those notebooks using the latest version of bitsandbytes (0.42.0), and they seem to be running smoothly without any errors.
N.B. I have also commented on some specific versions in my code: 0.39.0 and 0.41.2.post2. Don't ask me if they worked, it's been too much time and I don't remember it :(
In general, my approach involved searching for a YouTube video where someone used the same code I was working with. Assuming the code in the video ran without errors, and the person installed the latest package versions available at the time, I would note the video's date and then downgrade all my packages to versions predating that date.
i been using bitsandbytes 0.41.2.post2, still get the same problem, i use a RTX4090 to load a llama-2-7b to inference, using bnb config
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=bnb_config,
)
and i check the GPU memory and cpu memory occupation, only a small portion (2.4GB), i tried 0.42.0, still meets the same problem
@JacopoMereu can you share the youtube link?
I am facing the same issue. I am trying it on Kaggle (gives 2x T4 GPU) but for some reason, only one GPU seems to be utilized. The utilization on other GPU remains 0.
I had the same issue but before this warning. I would get a complete OOM error from cuda and would not get any output. I was running this with out using pipeline function from transformers.
This is after attempting to clean up memory. Avoid using device.reset() it requires you to restart the service as it doesn't allow you to execute your code again until the service has been restarted.
it dosn't work for me too
It seems like nothing's changed since Nov 2023. Getting the same error message:
"""_ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True
and pass a custom device_map
to from_pretrained
. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details._"""
Trying to load Nvidia/NVLM-D-72 locally, downloaded from Huggingface, running the following code:
import torch
from transformers import AutoModel, AutoTokenizer
path = "nvidia/NVLM-D-72B"
device_map = "auto"
model = AutoModel.from_pretrained(
path,
load_in_8bit=True,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
trust_remote_code=True,
device_map=device_map).eval()
Specs: Bitsandbytes Version: 0.43.3 OS: Ubuntu 24.04.1 LTS RAM: 128 GB DDR5 GPU: 1x RTX 4090 CPU: Intel 14900K
Any help would be greatly appreciated.
ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized mode l. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set
load_in_8bit_fp32_cpu_offload=True
and pass a customdevice_map
tofrom_pretrained
. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.