Open elen07zz opened 1 year ago
I think this is because your GPU memory are to low.
I think this is because your GPU memory are to low.
what is the minimum I need, even with optimizations enabled?
Try this:
For AMD PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention
For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers
In my experience --opt-sub-quad-attention is the best vram optimizer for AMD cards and --xformers is the best for NVIDIA, so don't try using --medvram or --lowvram unless either of those don't work for you, and don't combine them like '--opt-sub-quad-attention --medvram' or '--xformers --lowvram' because in my testing it increased vram usage and made image generation slower, so only use one vram optimizer at once.
I'm also getting the 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!' error but it won't affect training in anyway, it just means you won't be able to see the preview images being generated in the webui, but you can still view them by going to /stable-diffusion-webui/textual_inversion/
For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers
I'm having the same issue, Where would I set this?
For Windows You would put --xformers into your webui-user.bat in the command-line arguments section, then open the webui directory in cmd and run this command: PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 webui-user.bat
For Linux Open a terminal in the Web-ui directory and run the command: PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers
Though I recommend switching to a docker container, I started using a docker container (using podman instead of docker) a little less than a week ago and I no longer have the issue when training.
Also forgot to mention you'll want to check 'enable cross attention optimizations when training' in the settings, this will reduce your vram usage while training by a lot
Thanks! I managed to add them manually directly to the webui.bat. I think (extreme emphasis on the think) adding it there sets the pytorch environment variable for the venv during its activation and, although I'm sure xformers is now doing it's job and I'm able to train, I'm not sure setting the pytorch variable the way I did actually works. Also because I'm in Windows and nvidia-smi won't actually show me vram usage for my 3080 I know how well it's running only when it dies and throws errors my way, which is not great.
I'd try the docker to avoid issues but i fought with them in the past having issues with virtualization and stuff.
Thanks again!
PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention
Hi, I'm adding this just for future reference, I'm using a 6750xt GPU and this solved my Hip out of memory problem when generating large images (1024x1536 from hires. fix - I added --opt-sub-quad-attention in the terminal commands). However, I'd like to add for future reference that since this GPU is not really "supported", HSA_OVERRIDE_GFX_VERSION=10.3.0 should be ran in order to avoid Segmentation fault (core dumped) error. (just in case someone also gets the same error - I'm using Linux Mint.)
Taken from a rentry troubleshooting page.
Segmentation fault (core dumped) "${python_cmd}" launch.py
You tried to force an incompatible binary with your gpu via the HSA_OVERRIDE_GFX_VERSION environment variable. Unset it via set -e HSA_OVERRIDE_GFX_VERSION and retry the command.
Looking at your crash log you have 10GB vram so I'm guessing it's a RX 6700?
Try using the new --upcast-sampling
feature which allows fp16 on AMD ROCm. Also --opt-sub-quad-attention
because other cross attention layer optimizations may cause problems with --upcast-sampling
.
python3 launch.py --upcast-sampling --opt-sub-quad-attention
In Settings -> Training enable "Move VAE and CLIP to RAM when training if possible" and "Use cross attention optimizations while training".
If using a SD 2.x model enable Settings -> Stable Diffusion -> "Upcast cross attention layer to float32".
With the above setup I'm able to train embeddings on a RX 5500XT 8GB (for 1.5 models anyway, haven't tried any 2.x training).
Try this:
For AMD PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention
For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers
In my experience --opt-sub-quad-attention is the best vram optimizer for AMD cards and --xformers is the best for NVIDIA, so don't try using --medvram or --lowvram unless either of those don't work for you, and don't combine them like '--opt-sub-quad-attention --medvram' or '--xformers --lowvram' because in my testing it increased vram usage and made image generation slower, so only use one vram optimizer at once.
I'm also getting the 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!' error but it won't affect training in anyway, it just means you won't be able to see the preview images being generated in the webui, but you can still view them by going to /stable-diffusion-webui/textual_inversion/
I need help doing this, can we do screenshare?
PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention
Results in unstable system, adding --opt-sub-quad-attention to launch args fixes the problem alone. Thank you.
Looking at your crash log you have 10GB vram so I'm guessing it's a RX 6700?
Try using the new
--upcast-sampling
feature which allows fp16 on AMD ROCm. Also--opt-sub-quad-attention
because other cross attention layer optimizations may cause problems with--upcast-sampling
.
python3 launch.py --upcast-sampling --opt-sub-quad-attention
In Settings -> Training enable "Move VAE and CLIP to RAM when training if possible" and "Use cross attention optimizations while training".
If using a SD 2.x model enable Settings -> Stable Diffusion -> "Upcast cross attention layer to float32".
With the above setup I'm able to train embeddings on a RX 5500XT 8GB (for 1.5 models anyway, haven't tried any 2.x training).
Just wanted to say thank you so much! I was not able to run SDXL in A1111 on my AMD 6700XT at all but after your suggestion its running fantastic , not out of memory and it faster then before. Running at 3.74s/it now. Game changer at least for me.
Is there an existing issue for this?
What happened?
Im trying to train a embedding but im getting this error. Running webui with this setttings python3 launch.py --precision full --no-half --opt-split-attention
100%|█████████████████████████████████████████| 616/616 [01:20<00:00, 7.67it/s] 0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/user/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 395, in train_embedding scaler.scale(loss).backward() File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB (GPU 0; 9.98 GiB total capacity; 8.51 GiB already allocated; 742.00 MiB free; 9.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
Steps to reproduce the problem
im receive this error Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!
0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/akairax/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 395, in train_embedding scaler.scale(loss).backward() File "/home/akairax/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/akairax/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper__native_layer_norm_backward)
What should have happened?
just run
Commit where the problem happens
874b975bf8438b2b5ee6d8540d63b2e2da6b8dbd
What platforms do you use to access UI ?
Linux
What browsers do you use to access the UI ?
Mozilla Firefox
Command Line Arguments
Additional information, context and logs
Running ubuntu 22.04