Open Zphyr00 opened 1 year ago
what resolution are you training ?
768 on one and 1024 on the other, both 2.1 768
Same problem. Using free colab. Training 640x640 on SD2.1-512px. Problem appears on text_encoder training stage.
Progress:| | 0% 1/1915 [00:08<4:42:14, 8.85s/it, loss=0.0194, lr=6e-7] DmRs Traceback (most recent call last):
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 803, in <module>
main()
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 690, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/operations.py", line 507, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.9/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/unet_2d_condition.py", line 632, in forward
sample = upsample_block(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/unet_2d_blocks.py", line 1805, in forward
hidden_states = torch.utils.checkpoint.checkpoint(
File "/usr/local/lib/python3.9/dist-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/usr/local/lib/python3.9/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.9/dist-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/unet_2d_blocks.py", line 1798, in custom_forward
return module(*inputs, return_dict=return_dict)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/transformer_2d.py", line 265, in forward
hidden_states = block(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/attention.py", line 324, in forward
ff_output = self.ff(norm_hidden_states)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/attention.py", line 382, in forward
hidden_states = module(hidden_states)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/attention.py", line 429, in forward
return hidden_states * self.gelu(gate)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.75 GiB total capacity; 13.33 GiB already allocated; 6.81 MiB free; 13.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Progress:| | 0% 1/1915 [00:09<5:06:52, 9.62s/it, loss=0.0194, lr=6e-7]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--train_only_text_encoder', '--image_captions_filename', '--train_text_encoder', '--dump_only_text_encoder', '--pretrained_model_name_or_path=/content/stable-diffusion-v2-512', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/DemiRose_640_v21_fast/instance_images', '--output_dir=/content/models/DemiRose_640_v21_fast', '--captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/DemiRose_640_v21_fast/captions', '--instance_prompt=', '--seed=229081', '--resolution=640', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=6e-07', '--lr_scheduler=linear', '--lr_warmup_steps=0', '--max_train_steps=1915']' returned non-zero exit status 1.
Something went wrong
fixed now, use the latest notebook
Two and a half days ago a CUDA out of memory error started simultaneously on two accounts with two different training with different pictures. And yes, I checked the resolution of each picture. The problem is only in training, Automatic handles even complex tasks without problems.