Colab dreambooth notebook fail

andrewssdd commented 9 months ago

Describe the bug

The Dreambooth Colab notebook fails at the training stage. Seems to be an issue with bitsandbytes.

Reproduction

Run the Dreambooth Colab notebook. It fails at training.

https://colab.research.google.com/github/ShivamShrirao/diffusers/blob/main/examples/dreambooth/DreamBooth_Stable_Diffusion.ipynb

Exception: CUDA SETUP: Setup Failed!

Logs

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/paths.py:105: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
  warn(
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
  warn(
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8013'), PosixPath('//172.28.0.1'), PosixPath('http')}
  warn(
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-3a2kk3hhilbsk --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true'), PosixPath('--logtostderr --listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https')}
  warn(
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/datalab/web/pyright/typeshed-fallback/stdlib,/usr/local/lib/python3.10/dist-packages')}
  warn(
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  warn(
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')}
  warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 122
CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda122.so
CUDA SETUP: Defaulting to libbitsandbytes.so...
CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!
CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
Traceback (most recent call last):
  File "/content/train_dreambooth.py", line 869, in <module>
    main(args)
  File "/content/train_dreambooth.py", line 571, in main
    import bitsandbytes as bnb
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 5, in <module>
    import bitsandbytes.functional as F
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py", line 13, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 41, in <module>
    lib = CUDALibrary_Singleton.get_instance().lib
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 37, in get_instance
    cls._instance.initialize()
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 27, in initialize
    raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!

System Info

Google Colab

andrewssdd commented 9 months ago

Here's the requirement cell that works.

!wget -q https://github.com/ShivamShrirao/diffusers/raw/main/examples/dreambooth/train_dreambooth.py
!wget -q https://github.com/ShivamShrirao/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py
%pip install  git+https://github.com/ShivamShrirao/diffusers
%pip install  -U --pre triton
%pip install  transformers ftfy bitsandbytes gradio natsort safetensors xformers torch==2.1.0+cu121 accelerate

VladAdushev commented 9 months ago

Here's the requirement cell that works.

returns an error

TragicXxBoNeSxX commented 9 months ago

Still returns "Exception: CUDA SETUP: Setup Failed!"

vadar007 commented 9 months ago

Was able to successfully execute with modified code. However, when attempting to use the generated model with Stable Diffusion was getting the following error:

Error verifying pickled file from D:\l......\.ckpt ** The file may be malicious, so the program is not going to read it. You can skip this check with --disable-safe-unpickle commandline argument.

Adding recommended command line argument allowed Stable Diffusion to utilize the model

TragicXxBoNeSxX commented 9 months ago

Changing this line: %pip install transformers ftfy bitsandbytes gradio natsort safetensors xformers torch==2.1.0+cu121 accelerate

To this: %pip install transformers ftfy bitsandbytes gradio natsort safetensors xformers torch==2.1.0+cu121 accelerate kaleido cohere openai tiktoken

Got it working for me again.

andrewssdd commented 9 months ago

Was able to successfully execute with modified code. However, when attempting to use the generated model with Stable Diffusion was getting the following error:

Error verifying pickled file from D:\l.......ckpt ** The file may be malicious, so the program is not going to read it. You can skip this check with --disable-safe-unpickle commandline argument.

Adding recommended command line argument allowed Stable Diffusion to utilize the model

You need to save the checkpoint as safetensors

Al-Rien commented 9 months ago

To this: %pip install transformers ftfy bitsandbytes gradio natsort safetensors xformers torch==2.1.0+cu121 accelerate kaleido cohere openai tiktoken

Still returns CUDA SETUP: Setup Failed.

TianyiPeng commented 9 months ago

Does anyone get it to work now?

domingosl commented 9 months ago

Same issue here after trying all suggestions, CUDA SETUP: Setup Failed

abc123desygn commented 9 months ago

I have the same issue. Can you please fix?

tibor commented 9 months ago

I’m not sure but I think this is a problem with the library bitsandbytes. I have opened a ticket here: https://github.com/TimDettmers/bitsandbytes/issues/950

VladAdushev commented 8 months ago

Has anyone managed to launch it?

deveshruttala commented 8 months ago

same issue with me

tibor commented 8 months ago

Yeah, it works with the fix at https://github.com/TimDettmers/bitsandbytes/issues/950

Kategus commented 8 months ago

Good day. Broke down again. If someone has a working version, please send it.

jackiter commented 7 months ago

Good day. Broke down again. If someone has a working version, please send it.

Yes please

chchchadzilla commented 7 months ago

I'd even settle for someone to just explain to me why it's broken so i can try and fix it. I've got it to work several times by installing different versions of torch with cuda, xformers, triton, and torchtext, torchaudio, torchvision, and torchdata, as well as gotten it to work by installing kaleido, pycairo, tiktoken, and openai--- but the problem is I was just throwing shit at a wall and hoping it sticks since I fundamentally don't understand what's happening and happened to hit pay dirt, so replicating it has proven difficult. Impossible, actually, in the last week specifically. Not sure if another update screwed the pooch on another module, but it's frustrating. I've tried to learn kohya_ss and I'm very, very bad at it, and regardless of following tutorials, it never works or maybe I'm just stupid. Either way, there's no user-friendly (in the lowest sense of the word) choice except this shivram colab, which in an of itself took an ungodly amount of trial and error to get it how it makes sense to me and for me getting good results. Now, though, it seems like no one gives a crap because it's outdated technology, and training LoRAs and now Stable Cascade and with stable diffusion 3 right around the corner for a public release, I'm afraid we won't see a fix. It's just frustrating as someone who does this as a hobbyist and not professionally, all the talking about it doesn't provide clearly defined, easy to follow solutions. It's all assumptive and predicated on you knowing what everyone is talking about, and not a step-by-step idiot-proof type of guide, which I feel like so many of us need to get good results but are too embarrassed or feel like we'll be made fun of or reprimanded somehow if we ask stupid questions. The whole thing is elitist, and it doesn't do anyone any good. It turns regular quazi-nerds like myself off from diving into this world head first, which you never know, you could be turning someone off to the whole thing that was the next visionary that would've written code or developed something that could have changed the game. That's a long shot, but, I think you get my point. It's frustrating that there's no information on this, and the solutions that are out there, are half-assed, typed out with the assumption that you're already a python developer and we're not. We're regular dudes who use this for fun, hobbies, and some of us used to make money off training models for people, or to get work done for our day jobs. And, look, I get it-- it's forced evolution, right? Figure it the eff out, or stop complaining and stop using it. But, for something that seems like it should be so damn easy to fix, I just don't understand the lack of anyone even wanting to try and help. It's disheartening. Sorry for the rant, tonight has been really frustrating and I'm no closer to getting pending work done-- I've got new characters to train into a model that'll let me finish up a comic book series I've had to back-burner for the last 3 months because of this, and I promise if someone helps me fix it I'll never use the damn software again and stop bugging everyone. Thanks.

Kategus commented 7 months ago

Bravo, great speech. But I'm afraid it won't bear positive fruit. Personally, in my opinion, this is not an outdated technology, it's just that in addition to Lora, Dreambooth technology gives very good results in terms of similarity. So far, I have not seen the same similarity in the new products. It seems that the guy was just cut off the Internet or taken to the army).

Olivier-aka-Raiden commented 6 months ago

If anyone interested, I successfully trained my model by installing requirements as follows: %pip install transformers ftfy bitsandbytes gradio natsort safetensors xformers torch==2.2.1 accelerate kaleido cohere openai tiktoken

Kategus commented 6 months ago

If anyone interested, I successfully trained my model by installing requirements as follows: %pip install transformers ftfy bitsandbytes gradio natsort safetensors xformers torch==2.2.1 accelerate kaleido cohere openai tiktoken

Thanks for the hint! But this option did not last long, it gives an error again. Maybe there will be masters and fix it?

Baconwrappedfriedpickles commented 5 months ago

Thanks for the hint! But this option did not last long, it gives an error again. Maybe there will be masters and fix it?

Someone on another site suggested adding this to the requirements and it's working for me. Hope it helps. %pip install "jax[cuda12_local]==0.4.23" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

ShivamShrirao / diffusers