Closed kotaxyz closed 1 year ago
Hello. I am getting a similar error when attempting to reproduce the Wednesday Addams example using the Pivotal Tuning script. Though to be fair I have no idea if it is supposed to run on with only 8GB VRAM?
I should note that I'm running on ROCm using a 6600 xt -- though I've had no trouble running automatic1111 or the kohya_ss stuff.
Traceback (most recent call last):
File "/home/A_User/LORA/lora/train_lora_w_ti.py", line 1164, in
These are OOM errors. Ti might require larger VRAM than just unet training
Ahh that's what I was afraid of, thanks -- I have seen similar errors arise from the other repositories due to bugs that were later resolved so I figured it was worth asking
Sorry to bug -- just one followup if possible.... Would it be possible to know how much VRAM is typically required? I am planning on getting new hardware due to the support issues with ROCm more generally so that would be very useful information to have.
so how some people are able to run trainning not even with 8gb but with 6gb im confused what is the problem
The thing that definitely works with <12 GB is LoRa, without any bells and whistles. I think you are trying to use dreambooth sans lora which is definitely a no-go. I was hoping that the pivotal tuning would be lower VRAM like lora but for now it seems that anything involving textual inversion is going to be a no with lower VRAM cards.
Yes pti uses more VRAM. But it is certainly possible in the future for them to use less memory in theory.
mmm actually i use lora these are the settings i use to run the trainning
The thing that definitely works with <12 GB is LoRa, without any bells and whistles. I think you are trying to use dreambooth sans lora which is definitely a no-go. I was hoping that the pivotal tuning would be lower VRAM like lora but for now it seems that anything involving textual inversion is going to be a no with lower VRAM cards.
I had a similar issue and it was just this - if using Kohya make sure you're in the Dreambooth LoRa tab.
when i try to start training it gives me this error note that im using gtx 1070 8gbs
***** Running training ***** Instance Images: 9 Class Images: 0 Total Examples: 9 Num batches each epoch = 9 Num Epochs = 100 Batch Size Per Device = 1 Gradient Accumulation steps = 1 Total train batch size (w. parallel, distributed & accumulation) = 9 Total optimization steps = 900 Total training steps = 900 Resuming from checkpoint: False First resume epoch: 0 First resume step: 0 Lora: True, Adam: True, Prec: fp16 Gradient Checkpointing: True, Text Enc Steps: -1.0 EMA: False LR: 2e-06) Steps: 0%| | 0/900 [00:00<?, ?it/s]OOM Detected, reducing batch/grad size to 0/1. Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 86, in decorator return function(batch_size, grad_size, *args, **kwargs) File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 904, in inner_loop accelerator.backward(loss) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\accelerate\accelerator.py", line 1314, in backward self.scaler.scale(loss).backward(**kwargs) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply return user_fn(self, *args) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "I:\stablediffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 7.23 GiB already allocated; 0 bytes free; 7.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Steps: 0%| | 0/900 [00:14<?, ?it/s] Traceback (most recent call last): File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\scripts\dreambooth.py", line 569, in start_training result = main(config, use_subdir=use_subdir, lora_model=lora_model_name, File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1024, in main return inner_loop() File "I:\stablediffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 84, in decorator raise RuntimeError("No executable batch size found, reached zero.") RuntimeError: No executable batch size found, reached zero. Training completed, reloading SD Model. Restored system models. Returning result: Exception training model: No executable batch size found, reached zero.