Open LIQUIDMIND111 opened 2 months ago
always CUDA SETUP FAILS......
Same issue for me. Looks like the owner either doesn't know how to fix this or isn't fussed anymore
its working now, but after i change the google runtime from L4 to T4, and yesterday i used an A100 no issues....... maybe its an error on both sides? google GPU and the collab page...
@TheLastBen i found the glitch - is when using L4 GPU, it will give a CUDA SETUP ERROR, and on A100 and T4 you dont get an error..... the bad side of this is that we are paying for google credits or PRO, and cannot use faster GPUs because A100 is NOT always available and its 11.30 credits PER HOUR compared to L4 that is 4 credits and hour....... so at the end, we pay ONLY for MORE TIME instead of FASTER GPUs, if A100 is not available, since L4 will give CUDA ERROR....
Are you aware of this issue?
I'm aware, I'll try to find a fix
I get a good model for a day or two, then next training i get this:
Traceback (most recent call last): File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 803, in
main()
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 535, in main
import bitsandbytes as bnb
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/init.py", line 6, in
from .autograd._functions import (
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 5, in
import bitsandbytes.functional as F
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py", line 13, in
from .cextension import COMPILED_WITH_CUDA, lib
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 41, in
lib = CUDALibrary_Singleton.get_instance().lib
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 37, in get_instance
cls._instance.initialize()
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 27, in initialize
raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--image_captions_filename', '--train_only_unet', '--save_starting_step=500', '--save_n_steps=0', '--Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI/instance_images', '--output_dir=/content/models/NicoleTEST768-TEXT4NXI', '--captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI/captions', '--instance_prompt=', '--seed=869457', '--resolution=768', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=2e-06', '--lr_scheduler=linear', '--lr_warmup_steps=0', '--max_train_steps=1500']' returned non-zero exit status 1.
Something went wrong