TheLastBen / fast-stable-diffusion

fast-stable-diffusion + DreamBooth
MIT License
7.42k stars 1.28k forks source link

GOOGLE COLLAB works well for 2 days, then breaks. Why? #2827

Open LIQUIDMIND111 opened 2 months ago

LIQUIDMIND111 commented 2 months ago

I get a good model for a day or two, then next training i get this:

Traceback (most recent call last): File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 803, in main() File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 535, in main import bitsandbytes as bnb File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/init.py", line 6, in from .autograd._functions import ( File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 5, in import bitsandbytes.functional as F File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py", line 13, in from .cextension import COMPILED_WITH_CUDA, lib File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 41, in lib = CUDALibrary_Singleton.get_instance().lib File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 37, in get_instance cls._instance.initialize() File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 27, in initialize raise Exception('CUDA SETUP: Setup Failed!') Exception: CUDA SETUP: Setup Failed! Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--image_captions_filename', '--train_only_unet', '--save_starting_step=500', '--save_n_steps=0', '--Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI/instance_images', '--output_dir=/content/models/NicoleTEST768-TEXT4NXI', '--captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI/captions', '--instance_prompt=', '--seed=869457', '--resolution=768', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=2e-06', '--lr_scheduler=linear', '--lr_warmup_steps=0', '--max_train_steps=1500']' returned non-zero exit status 1. Something went wrong

LIQUIDMIND111 commented 2 months ago

always CUDA SETUP FAILS......

LIQUIDMIND111 commented 2 months ago

Same issue for me. Looks like the owner either doesn't know how to fix this or isn't fussed anymore

its working now, but after i change the google runtime from L4 to T4, and yesterday i used an A100 no issues....... maybe its an error on both sides? google GPU and the collab page...

LIQUIDMIND111 commented 2 months ago

@TheLastBen i found the glitch - is when using L4 GPU, it will give a CUDA SETUP ERROR, and on A100 and T4 you dont get an error..... the bad side of this is that we are paying for google credits or PRO, and cannot use faster GPUs because A100 is NOT always available and its 11.30 credits PER HOUR compared to L4 that is 4 credits and hour....... so at the end, we pay ONLY for MORE TIME instead of FASTER GPUs, if A100 is not available, since L4 will give CUDA ERROR....

Are you aware of this issue?

TheLastBen commented 2 months ago

I'm aware, I'll try to find a fix