training issue (CUDA out of memory)

kotaxyz commented 1 year ago

when i try to start training it gives me this error note that im using gtx 1070 8gbs

***** Running training ***** Instance Images: 9 Class Images: 0 Total Examples: 9 Num batches each epoch = 9 Num Epochs = 100 Batch Size Per Device = 1 Gradient Accumulation steps = 1 Total train batch size (w. parallel, distributed & accumulation) = 9 Total optimization steps = 900 Total training steps = 900 Resuming from checkpoint: False First resume epoch: 0 First resume step: 0 Lora: True, Adam: True, Prec: fp16 Gradient Checkpointing: True, Text Enc Steps: -1.0 EMA: False LR: 2e-06) Steps: 0%| | 0/900 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 0/1. Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 86, in decorator return function(batch_size, grad_size, *args, **kwargs) File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 904, in inner_loop accelerator.backward(loss) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 1314, in backward self.scaler.scale(loss).backward(**kwargs) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply return user_fn(self, *args) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 7.23 GiB already allocated; 0 bytes free; 7.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Steps: 0%| | 0/900 [00:14<?, ?it/s] Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\scripts\dreambooth.py", line 569, in start_training result = main(config, use_subdir=use_subdir, lora_model=lora_model_name, File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1024, in main return inner_loop() File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 84, in decorator raise RuntimeError("No executable batch size found, reached zero.") RuntimeError: No executable batch size found, reached zero. Training completed, reloading SD Model. Restored system models. Returning result: Exception training model: No executable batch size found, reached zero.

FunWithFaces commented 1 year ago

Hello. I am getting a similar error when attempting to reproduce the Wednesday Addams example using the Pivotal Tuning script. Though to be fair I have no idea if it is supposed to run on with only 8GB VRAM?

I should note that I'm running on ROCm using a 6600 xt -- though I've had no trouble running automatic1111 or the kohya_ss stuff.

Traceback (most recent call last): File "/home/A_User/LORA/lora/train_lora_w_ti.py", line 1164, in main(args) File "/home/A_User/LORA/lora/train_lora_w_ti.py", line 995, in main model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample File "/home/A_User/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/A_User/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 490, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) File "/home/A_User/.local/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(args, kwargs) File "/home/A_User/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 407, in forward sample = upsample_block( File "/home/A_User/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/A_User/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1203, in forward hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states).sample File "/home/A_User/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/A_User/.local/lib/python3.10/site-packages/diffusers/models/attention.py", line 216, in forward hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep) File "/home/A_User/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/A_User/.local/lib/python3.10/site-packages/diffusers/models/attention.py", line 484, in forward hidden_states = self.attn1(norm_hidden_states) + hidden_states File "/home/A_User/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/A_User/.local/lib/python3.10/site-packages/diffusers/models/attention.py", line 594, in forward hidden_states = self.to_out0 File "/home/A_User/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/A_User/LORA/lora/lora_diffusion/lora.py", line 30, in forward return self.linear(input) + self.lora_up(self.lora_down(input)) self.scale File "/home/A_User/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/A_User/.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.98 GiB total capacity; 7.62 GiB already allocated; 292.00 MiB free; 7.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF Steps: 0%| | 0/3000 [00:16<?, ?it/s] Traceback (most recent call last): File "/home/A_User/.local/bin/accelerate", line 8, in sys.exit(main()) File "/home/A_User/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/home/A_User/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command simple_launcher(args) File "/home/A_User/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_lora_w_ti.py', '--pretrained_model_name_or_path=./SD15', '--instance_data_dir=./data_example_text', '--output_dir=./output_example_lorpt', '--train_text_encoder', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=1e-5', '--learning_rate_text=1e-5', '--learning_rate_ti=5e-4', '--color_jitter', '--lr_scheduler=constant', '--lr_warmup_steps=100', '--max_train_steps=3000', '--placeholder_token=', '--learnable_property=object', '--initializer_token=woman', '--save_steps=500', '--unfreeze_lora_step=2000', '--stochastic_attribute=realistic, dark hair,cute,4k,highres']' returned non-zero exit status 1.

cloneofsimo commented 1 year ago

These are OOM errors. Ti might require larger VRAM than just unet training

FunWithFaces commented 1 year ago

Ahh that's what I was afraid of, thanks -- I have seen similar errors arise from the other repositories due to bugs that were later resolved so I figured it was worth asking

FunWithFaces commented 1 year ago

Sorry to bug -- just one followup if possible.... Would it be possible to know how much VRAM is typically required? I am planning on getting new hardware due to the support issues with ROCm more generally so that would be very useful information to have.

kotaxyz commented 1 year ago

so how some people are able to run trainning not even with 8gb but with 6gb im confused what is the problem

FunWithFaces commented 1 year ago

The thing that definitely works with <12 GB is LoRa, without any bells and whistles. I think you are trying to use dreambooth sans lora which is definitely a no-go. I was hoping that the pivotal tuning would be lower VRAM like lora but for now it seems that anything involving textual inversion is going to be a no with lower VRAM cards.

cloneofsimo commented 1 year ago

Yes pti uses more VRAM. But it is certainly possible in the future for them to use less memory in theory.

kotaxyz commented 1 year ago

mmm actually i use lora these are the settings i use to run the trainning

lora dreambooth

pearswick commented 1 year ago

The thing that definitely works with <12 GB is LoRa, without any bells and whistles. I think you are trying to use dreambooth sans lora which is definitely a no-go. I was hoping that the pivotal tuning would be lower VRAM like lora but for now it seems that anything involving textual inversion is going to be a no with lower VRAM cards.

I had a similar issue and it was just this - if using Kohya make sure you're in the Dreambooth LoRa tab.

cloneofsimo / lora

training issue (CUDA out of memory) #112