bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.02k stars 606 forks source link

Using nerdy rodent's dreamlab training, I have error on training about cuda. #52

Closed 311-code closed 1 year ago

311-code commented 1 year ago

I am using Nerdy Rodent's dreamlab local install video which I have followed step by step, at the end bitsandbytes seems to give an error. I tried reloading all the CUDA stuff and tried the new 11.8 cuda version which seems to differ from video and still gives same error:

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:86: UserWarning: /home/user/anaconda3/envs/diffusers did not contain libcudart.so as expected! Searching further paths... warn( /home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('CompVis/stable-diffusion-v1-4')} warn( CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine! Traceback (most recent call last): File "/home/user/github/diffusers/examples/dreambooth/train_dreambooth.py", line 657, in main() File "/home/user/github/diffusers/examples/dreambooth/train_dreambooth.py", line 446, in main import bitsandbytes as bnb File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/init.py", line 6, in from .autograd._functions import ( File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 5, in import bitsandbytes.functional as F File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/functional.py", line 13, in from .cextension import COMPILED_WITH_CUDA, lib File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 41, in lib = CUDALibrary_Singleton.get_instance().lib File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 37, in get_instance cls._instance.initialize() File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 15, in initialize binary_name = evaluate_cuda_setup() File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 132, in evaluate_cuda_setup cc = get_compute_capability(cuda) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 105, in get_compute_capability ccs = get_compute_capabilities(cuda) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 83, in get_compute_capabilities check_cuda_result(cuda, cuda.cuDeviceGetCount(ctypes.byref(nGpus))) AttributeError: 'NoneType' object has no attribute 'cuDeviceGetCount' Traceback (most recent call last): File "/home/user/anaconda3/envs/diffusers/bin/accelerate", line 8, in sys.exit(main()) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/user/anaconda3/envs/diffusers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=training', '--output_dir=classes', '--instance_prompt=A sks dog', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=no', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--sample_batch_size=4', '--max_train_steps=800']' returned non-zero exit status 1.

Thomas-MMJ commented 1 year ago

11.8 isn't currently supported, you might try an older CUDA library version I'd go with 11.6 or earlier.

ZeroCool22 commented 1 year ago

11.8 isn't currently supported, you might try an older CUDA library version I'd go with 11.6 or earlier.

Screenshot_5

Same error and i'm on 11.7:

Screenshot_6

GPU: 1080 ti

How i downgrade to 11.6, just copy this commands:

Screenshot_8

and it will downgrade or need to uninstall Ubuntu and start all over again?

Or need to deleted everything CUDA related with this commands?

Even with those commands, the issue wasn’t solved.
Eventually, the fastest way to fix 2 machines with a package manager is to purge all Nvidia & Cuda,did it by:

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'
ZeroCool22 commented 1 year ago

@brentjohnston

What GPU you have and what you selected on accelerate config when asking [NO/fp16/bf16]?

PD: I tried different selections but nothing changed.

ZeroCool22 commented 1 year ago

11.8 isn't currently supported, you might try an older CUDA library version I'd go with 11.6 or earlier.

Can confirm that with CUDA 11.6 it works, at least with a 1080 TI.

Screenshot_9

WSL + Ubuntu DB Working CUDA 11 6!

The guide of nerdy rodent's use 11.7 on the Pastebin and in the video he shows 11.8, so none of them will work, following that part it will never have worked.

nerdyrodent commented 1 year ago

In the video, pastebin and on my system I use CUDA 11.7.1. - typically Nvidia updated the day after ;) You'll need to ensure your MS Windows system is up-to-date as well. If you have old Nvidia drivers in MS Windows you may need to downgrade CUDA.

Where it says CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine! you need to reboot / add the line as stated in the video & shown in pastebin file: export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

ZeroCool22 commented 1 year ago

port LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Correct, this was the main cause, not the CUDA version.

The export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH need to be in the config of the train file.

Even if you reboot, it will still not find CUDA if that line is not added.

But in your video you say, "reboot or add this line". So ppl take that as if you restart not need to add that line, but the line must be added permanent in the config.

TimDettmers commented 1 year ago

This is super helpful — thank you, everyone! I will add CUDA 11.8 as soon as possible!

TimDettmers commented 1 year ago

CUDA 11.8 was added in the lastest release. I also added code that gives some compilation and debugging instructions if the CUDA setup fails.

Spaceisprettybig commented 1 year ago

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

port LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Correct, this was the main cause, not the CUDA version.

The export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH need to be in the config of the train file.

Even if you reboot, it will still not find CUDA if that line is not added.

But in your video you say, "reboot or add this line". So ppl take that as if you restart not need to add that line, but the line must be added permanent in the config.

Sorry to bother, but for us tech newbies, how does one do that?

ZeroCool22 commented 1 year ago

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

port LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Correct, this was the main cause, not the CUDA version. The export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH need to be in the config of the train file. Even if you reboot, it will still not find CUDA if that line is not added. But in your video you say, "reboot or add this line". So ppl take that as if you restart not need to add that line, but the line must be added permanent in the config.

Sorry to bother, but for us tech newbies, how does one do that?

In your train file:

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH export MODEL_NAME="darkstorm2150/Protogen_x3.4_Official_Release" export INSTANCE_DIR="training" export OUTPUT_DIR="my_model"

accelerate launch train_dreambooth.py \ --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --output_dir=$OUTPUT_DIR \ --train_text_encoder \ --instance_prompt="laarretaa" \ --resolution=512 \ --train_batch_size=1 \ --learning_rate=1e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --gradient_accumulation_steps=2 --gradient_checkpointing \ --use_8bit_adam \ --save_interval=500 \ --max_train_steps=4500

MikkoHaavisto commented 1 year ago

I have this issue with nerdy rodents guide on oobabooga's text-generation-webui with one-click installer on gtx 1080ti in windows. Bitsandbytes cannot find cuda. What is the solution there? Can I add that line somewhere?

adamsanders commented 1 year ago

I have this issue with nerdy rodents guide on oobabooga's text-generation-webui with one-click installer on gtx 1080ti in windows. Bitsandbytes cannot find cuda. What is the solution there? Can I add that line somewhere?

See this post https://github.com/oobabooga/text-generation-webui/issues/20#issuecomment-1411650652 :)

caizhuoyue77 commented 6 months ago

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

port LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Correct, this was the main cause, not the CUDA version. The export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH need to be in the config of the train file. Even if you reboot, it will still not find CUDA if that line is not added. But in your video you say, "reboot or add this line". So ppl take that as if you restart not need to add that line, but the line must be added permanent in the config.

Sorry to bother, but for us tech newbies, how does one do that?

In your train file:

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH export MODEL_NAME="darkstorm2150/Protogen_x3.4_Official_Release" export INSTANCE_DIR="training" export OUTPUT_DIR="my_model"

accelerate launch train_dreambooth.py --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" --pretrained_model_name_or_path=$MODEL_NAME --instance_data_dir=$INSTANCE_DIR --output_dir=$OUTPUT_DIR --train_text_encoder --instance_prompt="laarretaa" --resolution=512 --train_batch_size=1 --learning_rate=1e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam --save_interval=500 --max_train_steps=4500

Hi, I got the same error but I don't have the folder "/usr/lib/wsl", could you tell me what the problem might be? Much appreciated!