Open dushwe opened 4 months ago
Same problem here。 I train controlnet with 1 batchsize, and got cuda out of memory error with 80g vram
lr: 1.0e-4 batch_size: 1 image_size: 768
grad_accum_steps: 1 updates: 100000 backup_every: 20000 save_every: 2000 warmup_updates: 1 use_fsdp: False adaptive_loss_weight: True
print('1-load models start:',torch.cuda.memory_allocated())
models = self.setup_models(extras)
print('2-load models end:',torch.cuda.memory_allocated())
1-load models start: 0 2-load models end: 18517808640
print('3-optimizers start:',torch.cuda.memory_allocated())
optimizers = self.setup_optimizers(extras, models)
print('4-optimizers end:',torch.cuda.memory_allocated())
3-optimizers start: 18517808640 4-optimizers end: 47230638592
conditions = self.get_conditions(batch, models, extras)
print('11-conditons:',torch.cuda.memory_allocated())
latents = self.encode_latents(batch, models, extras)
print('12-encode-latents:',torch.cuda.memory_allocated())
11-conditons: 47248081920 12-encode-latents: 47248118784
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
pred = models.generator(noised, noise_cond, **conditions)
print("13-models-generator:",torch.cuda.memory_allocated())
13-models-generator: 60122718720
loss, loss_adjusted = self.forward_pass(data, extras, models)
print("14-forward_pass:",torch.cuda.memory_allocated())
# # BACKWARD PASS
grad_norm = self.backward_pass(
i % self.config.grad_accum_steps == 0 or i == max_iters, loss, loss_adjusted,
models, optimizers, schedulers
)
print("15-backward_pass:",torch.cuda.memory_allocated())
14-forward_pass: 59979117056 15-backward_pass: 47247679488
truly wild, to get max 1 batch size on 80Gb VRAM. something is definitely wrong here. sad, too, seemed like someone made an effort to document and make the repo usable, too.
perhaps they assume you're using multiple GPUs and FSDP if finetuning the big models?
Furthermore, since distributed training is essential when training large models from scratch or doing large finetunes, we have an option to use PyTorch's Fully Shared Data Parallel (FSDP). You can use it by setting use_fsdp: True. Note, that you will need multiple GPUs for FSDP. However, this as mentioned above, this is only needed for large runs. You can still train and finetune our largest models on a powerful single machine.
Update: Just tried FSDP and 2 A100s to see if that would help. Now all that happens is my CPU works very hard and that's it. I think this repo makes a lot of assumptions about your setup: FSDP, multi-GPU, slurm, etc. https://github.com/Stability-AI/StableCascade/issues/71#issuecomment-1974039472
how max batch size in A100 V80G ?
with a batch size of 1, it seems to be peaking at 75457MiB of VRAM according to
nvidia-smi
on an A100 with 80 GB of VRAM,