d8ahazard / sd_dreambooth_extension

Other
1.86k stars 282 forks source link

Xformers Not found on PaperSpace #1041

Closed Maki9009 closed 1 year ago

Maki9009 commented 1 year ago

Kindly read the entire form below and fill it out with the requested information.

Please find the following lines in the console and paste them below. If you do not provide this information, your issue will be automatically closed.

Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] Commit hash: 5be87ba63f62c228cf135425e21577f70c4e3351 Installing requirements for Web UI Skipping dreambooth installation. Dreambooth revision is The Latest Diffusers version is I'd like to assume its the latest also . Torch version is torch-1.13.1+cu117 Torch vision version is torchvision-0.14.1+cu117..

Have you read the Readme? yes

Have you completely restarted the stable-diffusion-webUI, not just reloaded the UI? yes

Have you updated Dreambooth to the latest revision? yes

Have you updated the Stable-Diffusion-WebUI to the latest version? yes

No, really. Please save us both some trouble and update the SD-WebUI and Extension and restart before posting this. Reply 'OK' Below to acknowledge that you did this.

Describe the bug

I'm trying to run the latest build of Auto1111 Stable Diffusion UI on PaperSpace, and the latest version of this dreambooth extension. everything runs it found the class images, and it was catching latents. then suddenly it cant find Xformers even tho xformers is already installed. I've run older commits of dreambooth on paperspace with the same GPU and it has worked... but now with the latest updates something broke with the generating class images... so now i updated.. the class image issue seems to be fixed but now it cant find xformers so idk.

Provide logs

Running training Num batches each epoch = 2 Num Epochs = 150 Batch Size Per Device = 13 Gradient Accumulation steps = 1 Total train batch size (w. parallel, distributed & accumulation) = 13 Text Encoder Epochs: 112 Total optimization steps = 1950 Total training steps = 3900 Resuming from checkpoint: False First resume epoch: 0 First resume step: 0 Lora: False, Optimizer: 8Bit Adam, Prec: bf16 Gradient Checkpointing: False EMA: True UNET: True Freeze CLIP Normalization Layers: False LR: 2e-06 V2: False Steps: 0%| | 0/3900 [00:00<?, ?it/s]Traceback (most recent call last): File "/storage/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/ui_functions.py", line 681, in start_training result = main(use_txt2img=use_txt2img) File "/storage/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1148, in main return inner_loop() File "/storage/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 123, in decorator return function(batch_size, grad_size, prof, *args, kwargs) File "/storage/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 970, in inner_loop noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/operations.py", line 489, in call return convert_to_fp32(self.model_forward(args, kwargs)) File "/usr/local/lib/python3.9/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, kwargs) File "/usr/local/lib/python3.9/dist-packages/diffusers/models/unet_2d_condition.py", line 580, in forward sample, res_samples = downsample_block( File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.9/dist-packages/diffusers/models/unet_2d_blocks.py", line 837, in forward hidden_states = attn( File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.9/dist-packages/diffusers/models/transformer_2d.py", line 265, in forward hidden_states = block( File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.9/dist-packages/diffusers/models/attention.py", line 291, in forward attn_output = self.attn1( File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/usr/local/lib/python3.9/dist-packages/diffusers/models/cross_attention.py", line 205, in forward return self.processor( File "/usr/local/lib/python3.9/dist-packages/diffusers/models/cross_attention.py", line 456, in call hidden_states = xformers.ops.memory_efficient_attention( File "/usr/local/lib/python3.9/dist-packages/xformers/ops/fmha/init.py", line 197, in memory_efficient_attention return _memory_efficient_attention( File "/usr/local/lib/python3.9/dist-packages/xformers/ops/fmha/init.py", line 298, in _memory_efficient_attention return _fMHA.apply( File "/usr/local/lib/python3.9/dist-packages/xformers/ops/fmha/init.py", line 43, in forward out, op_ctx = _memory_efficient_attention_forward_requires_grad( File "/usr/local/lib/python3.9/dist-packages/xformers/ops/fmha/init.py", line 326, in _memory_efficient_attention_forward_requires_grad out = op.apply(inp, needs_gradient=True) File "/usr/local/lib/python3.9/dist-packages/xformers/ops/fmha/flash.py", line 240, in apply out, softmax_lse = cls.OPERATOR( File "/usr/local/lib/python3.9/dist-packages/xformers/ops/common.py", line 13, in no_such_operator raise RuntimeError( RuntimeError: No such operator xformers_flash::flash_fwd - did you forget to build xformers with python setup.py develop?

Environment

What OS? Linux

If Windows - WSL or native?

What GPU are you using? A4000 16GB

Screenshots/Config If the issue is specific to an error while training, please provide a screenshot of training parameters or the db_config.json file from /models/dreambooth/MODELNAME/db_config.json

ArrowM commented 1 year ago

Thanks, can you please post this portion of your startup log: image

Maki9009 commented 1 year ago

Thanks, can you please post this portion of your startup log: image

Dreambooth revision: 5be87ba63f62c228cf135425e21577f70c4e3351 SD-WebUI revision: 0cc0ee1bcb4c24a8c9715f66cede06601bfc00c8

Checking Dreambooth requirements... Ignoring tensorflow-macos: markers 'sys_platform == "darwin" and platform_machine == "arm64"' don't match your environment Ignoring mediapipe-silicon: markers 'sys_platform == "darwin"' don't match your environment Collecting accelerate==0.16.0 Downloading accelerate-0.16.0-py3-none-any.whl (199 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.7/199.7 kB 31.8 MB/s eta 0:00:00 Collecting bitsandbytes==0.35.4 Downloading bitsandbytes-0.35.4-py3-none-any.whl (62.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.5/62.5 MB 27.6 MB/s eta 0:00:00 Collecting diffusers==0.13.1 Downloading diffusers-0.13.1-py3-none-any.whl (716 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 716.4/716.4 kB 81.5 MB/s eta 0:00:00 Collecting gitpython~=3.1.31 Downloading GitPython-3.1.31-py3-none-any.whl (184 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 184.3/184.3 kB 48.8 MB/s eta 0:00:00 Collecting mediapipe Downloading mediapipe-0.9.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (33.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.0/33.0 MB 45.5 MB/s eta 0:00:00 Collecting transformers~=4.26.1 Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 95.7 MB/s eta 0:00:00 Collecting discord-webhook~=1.1.0 Downloading discord_webhook-1.1.0-py3-none-any.whl (12 kB) Collecting lion-pytorch~=0.0.7 Downloading lion_pytorch-0.0.7-py3-none-any.whl (4.3 kB) Collecting xformers==0.0.17.dev464 Downloading xformers-0.0.17.dev464-cp39-cp39-manylinux2014_x86_64.whl (129.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.6/129.6 MB 11.8 MB/s eta 0:00:00 Collecting protobuf<3.20,>=3.9.2 Downloading protobuf-3.19.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 40.8 MB/s eta 0:00:00 Collecting pyre-extensions==0.0.23 Downloading pyre_extensions-0.0.23-py3-none-any.whl (11 kB) Collecting typing-inspect Downloading typing_inspect-0.8.0-py3-none-any.whl (8.7 kB) Collecting attrs>=19.1.0 Downloading attrs-22.2.0-py3-none-any.whl (60 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.0/60.0 kB 4.1 MB/s eta 0:00:00 Collecting opencv-contrib-python Downloading opencv_contrib_python-4.7.0.72-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (67.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.9/67.9 MB 22.6 MB/s eta 0:00:00 Collecting requests Downloading requests-2.28.2-py3-none-any.whl (62 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.8/62.8 kB 19.0 MB/s eta 0:00:00 Collecting mypy-extensions>=0.3.0 Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB) Installing collected packages: bitsandbytes, requests, protobuf, opencv-contrib-python, mypy-extensions, attrs, typing-inspect, lion-pytorch, gitpython, discord-webhook, accelerate, transformers, pyre-extensions, mediapipe, diffusers, xformers Attempting uninstall: bitsandbytes Found existing installation: bitsandbytes 0.35.0 Uninstalling bitsandbytes-0.35.0: Successfully uninstalled bitsandbytes-0.35.0 Attempting uninstall: requests Found existing installation: requests 2.25.1 Uninstalling requests-2.25.1: Successfully uninstalled requests-2.25.1 Attempting uninstall: protobuf Found existing installation: protobuf 3.20.3 Uninstalling protobuf-3.20.3: Successfully uninstalled protobuf-3.20.3 Attempting uninstall: attrs Found existing installation: attrs 18.2.0 Uninstalling attrs-18.2.0: Successfully uninstalled attrs-18.2.0 Attempting uninstall: gitpython Found existing installation: GitPython 3.1.27 Uninstalling GitPython-3.1.27: Successfully uninstalled GitPython-3.1.27 Attempting uninstall: discord-webhook Found existing installation: discord-webhook 1.0.0 Uninstalling discord-webhook-1.0.0: Successfully uninstalled discord-webhook-1.0.0 Attempting uninstall: accelerate Found existing installation: accelerate 0.12.0 Uninstalling accelerate-0.12.0: Successfully uninstalled accelerate-0.12.0 Attempting uninstall: transformers Found existing installation: transformers 4.25.1 Uninstalling transformers-4.25.1: Successfully uninstalled transformers-4.25.1 Attempting uninstall: diffusers Found existing installation: diffusers 0.10.2 Uninstalling diffusers-0.10.2: Successfully uninstalled diffusers-0.10.2 Attempting uninstall: xformers Found existing installation: xformers 0.0.16+6f3c20f.d20230127 Uninstalling xformers-0.0.16+6f3c20f.d20230127: Successfully uninstalled xformers-0.0.16+6f3c20f.d20230127 Successfully installed accelerate-0.16.0 attrs-22.2.0 bitsandbytes-0.35.4 diffusers-0.13.1 discord-webhook-1.1.0 gitpython-3.1.31 lion-pytorch-0.0.7 mediapipe-0.9.1.0 mypy-extensions-1.0.0 opencv-contrib-python-4.7.0.72 protobuf-3.19.6 pyre-extensions-0.0.23 requests-2.28.2 transformers-4.26.1 typing-inspect-0.8.0 xformers-0.0.17.dev464

[+] torch version 1.13.1+cu117 installed. [+] torchvision version 0.14.1+cu117 installed. [+] accelerate version 0.16.0 installed. [+] bitsandbytes version 0.35.4 installed. [+] diffusers version 0.13.1 installed. [+] transformers version 4.26.1 installed. [+] xformers version 0.0.17.dev464 installed.

Maki9009 commented 1 year ago

ive got it to train, once i turned off xformers... but it took it 3 loops of catching latents and running out of memory then it started training.... anyway to get xformers working again?

ArrowM commented 1 year ago

It looks like it just updated all the libraries, is xformers working now?

Maki9009 commented 1 year ago

no its still the same error, i can train without xformers though.. but idk whats wrong.. i even restarted the whole notebook and its still the same error.

sometimes training without xformers isn't isn't enough vram..

now when i train without xformers i get : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 15.74 GiB total capacity; 14.06 GiB already allocated; 179.56 MiB free; 14.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Steps: 0%| | 1/3900 [00:00<57:28, 1.13it/s, inst_loss=0.00946, loss=0.00946, Traceback (most recent call last): File "/storage/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/ui_functions.py", line 681, in start_training result = main(use_txt2img=use_txt2img) File "/storage/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1148, in main return inner_loop() File "/storage/stable-diffusion/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 121, in decorator raise RuntimeError("No executable batch size found, reached zero.") RuntimeError: No executable batch size found, reached zero. Restored system models.

But if i enable Gradent Checkpointing it starts to train

Maki9009 commented 1 year ago

Also a New issue i found is that... my number of steps is double of what i wanted? for 13 images i set it to 150 per image... which should be 1950... yet it always sets it at 3900

ArrowM commented 1 year ago

I think we fixed the xformers issue in dev. Should get merged sometime soon. The step count will be doubled if you are using class pics.

Maki9009 commented 1 year ago

id if the fix was included in the last merge commit, but the issue is still there

raymondgp commented 1 year ago

If you try precision FP instead of BF, does it work? This was happening to me until I selected FP...

ArrowM commented 1 year ago

If you try precision FP instead of BF, does it work? This was happening to me until I selected FP...

precision shouldn't impact it.

id if the fix was included in the last merge commit, but the issue is still there

Alright, pushed another fix. Could you please try again.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days