[Bug]: torch.cuda.OutOfMemoryError: HIP out of memory. When training embeddings

elen07zz commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

Im trying to train a embedding but im getting this error. Running webui with this setttings python3 launch.py --precision full --no-half --opt-split-attention

100%|█████████████████████████████████████████| 616/616 [01:20<00:00, 7.67it/s] 0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/user/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 395, in train_embedding scaler.scale(loss).backward() File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB (GPU 0; 9.98 GiB total capacity; 8.51 GiB already allocated; 742.00 MiB free; 9.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

Steps to reproduce the problem

Im getting this error when im run it using webui python3 launch.py --precision full --no-half --opt-split-attention
But if i run it using instead python3 launch.py --precision full --no-half --opt-split-attention --medvram
im receive this error Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!

0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/akairax/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 395, in train_embedding scaler.scale(loss).backward() File "/home/akairax/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/akairax/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper__native_layer_norm_backward)

What should have happened?

just run

Commit where the problem happens

874b975bf8438b2b5ee6d8540d63b2e2da6b8dbd

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

python3 launch.py --precision full --no-half --opt-split-attention
python3 launch.py --precision full --no-half --opt-split-attention --medvram

Additional information, context and logs

Running ubuntu 22.04

leohu1 commented 1 year ago

I think this is because your GPU memory are to low.

elen07zz commented 1 year ago

I think this is because your GPU memory are to low.

what is the minimum I need, even with optimizations enabled?

HiroseKoichi commented 1 year ago

Try this:

For AMD PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

In my experience --opt-sub-quad-attention is the best vram optimizer for AMD cards and --xformers is the best for NVIDIA, so don't try using --medvram or --lowvram unless either of those don't work for you, and don't combine them like '--opt-sub-quad-attention --medvram' or '--xformers --lowvram' because in my testing it increased vram usage and made image generation slower, so only use one vram optimizer at once.

I'm also getting the 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!' error but it won't affect training in anyway, it just means you won't be able to see the preview images being generated in the webui, but you can still view them by going to /stable-diffusion-webui/textual_inversion/

4xxFallacy commented 1 year ago

For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

I'm having the same issue, Where would I set this?

HiroseKoichi commented 1 year ago

For Windows You would put --xformers into your webui-user.bat in the command-line arguments section, then open the webui directory in cmd and run this command: PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 webui-user.bat

For Linux Open a terminal in the Web-ui directory and run the command: PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

Though I recommend switching to a docker container, I started using a docker container (using podman instead of docker) a little less than a week ago and I no longer have the issue when training.

Also forgot to mention you'll want to check 'enable cross attention optimizations when training' in the settings, this will reduce your vram usage while training by a lot

4xxFallacy commented 1 year ago

Thanks! I managed to add them manually directly to the webui.bat. I think (extreme emphasis on the think) adding it there sets the pytorch environment variable for the venv during its activation and, although I'm sure xformers is now doing it's job and I'm able to train, I'm not sure setting the pytorch variable the way I did actually works. Also because I'm in Windows and nvidia-smi won't actually show me vram usage for my 3080 I know how well it's running only when it dies and throws errors my way, which is not great.

I'd try the docker to avoid issues but i fought with them in the past having issues with virtualization and stuff.

Thanks again!

tnginako commented 1 year ago

PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

Hi, I'm adding this just for future reference, I'm using a 6750xt GPU and this solved my Hip out of memory problem when generating large images (1024x1536 from hires. fix - I added --opt-sub-quad-attention in the terminal commands). However, I'd like to add for future reference that since this GPU is not really "supported", HSA_OVERRIDE_GFX_VERSION=10.3.0 should be ran in order to avoid Segmentation fault (core dumped) error. (just in case someone also gets the same error - I'm using Linux Mint.)

Taken from a rentry troubleshooting page.

Segmentation fault (core dumped) "${python_cmd}" launch.py

You tried to force an incompatible binary with your gpu via the HSA_OVERRIDE_GFX_VERSION environment variable. Unset it via set -e HSA_OVERRIDE_GFX_VERSION and retry the command.

MrLavender commented 1 year ago

Looking at your crash log you have 10GB vram so I'm guessing it's a RX 6700?

Try using the new --upcast-sampling feature which allows fp16 on AMD ROCm. Also --opt-sub-quad-attention because other cross attention layer optimizations may cause problems with --upcast-sampling.

python3 launch.py --upcast-sampling --opt-sub-quad-attention

In Settings -> Training enable "Move VAE and CLIP to RAM when training if possible" and "Use cross attention optimizations while training".

If using a SD 2.x model enable Settings -> Stable Diffusion -> "Upcast cross attention layer to float32".

With the above setup I'm able to train embeddings on a RX 5500XT 8GB (for 1.5 models anyway, haven't tried any 2.x training).

mashiq3 commented 1 year ago

Try this:

For AMD PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

In my experience --opt-sub-quad-attention is the best vram optimizer for AMD cards and --xformers is the best for NVIDIA, so don't try using --medvram or --lowvram unless either of those don't work for you, and don't combine them like '--opt-sub-quad-attention --medvram' or '--xformers --lowvram' because in my testing it increased vram usage and made image generation slower, so only use one vram optimizer at once.

I'm also getting the 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!' error but it won't affect training in anyway, it just means you won't be able to see the preview images being generated in the webui, but you can still view them by going to /stable-diffusion-webui/textual_inversion/

I need help doing this, can we do screenshare?

Yama-K commented 1 year ago

PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

Results in unstable system, adding --opt-sub-quad-attention to launch args fixes the problem alone. Thank you.

YabbaYabbaYabba commented 1 year ago

Looking at your crash log you have 10GB vram so I'm guessing it's a RX 6700?

Try using the new --upcast-sampling feature which allows fp16 on AMD ROCm. Also --opt-sub-quad-attention because other cross attention layer optimizations may cause problems with --upcast-sampling.

python3 launch.py --upcast-sampling --opt-sub-quad-attention

In Settings -> Training enable "Move VAE and CLIP to RAM when training if possible" and "Use cross attention optimizations while training".

If using a SD 2.x model enable Settings -> Stable Diffusion -> "Upcast cross attention layer to float32".

With the above setup I'm able to train embeddings on a RX 5500XT 8GB (for 1.5 models anyway, haven't tried any 2.x training).

Just wanted to say thank you so much! I was not able to run SDXL in A1111 on my AMD 6700XT at all but after your suggestion its running fantastic , not out of memory and it faster then before. Running at 3.74s/it now. Game changer at least for me.

AUTOMATIC1111 / stable-diffusion-webui