Stability-AI / StableCascade

Official Code for Stable Cascade
MIT License
6.44k stars 518 forks source link

stage_c_3b_finetuning #92

Open dushwe opened 4 months ago

dushwe commented 4 months ago

how max batch size in A100 V80G ?

with a batch size of 1, it seems to be peaking at 75457MiB of VRAM according to nvidia-smi on an A100 with 80 GB of VRAM,

universewill commented 4 months ago

Same problem here。 I train controlnet with 1 batchsize, and got cuda out of memory error with 80g vram

dushwe commented 4 months ago

CUDA memory usage

config set

lr: 1.0e-4 batch_size: 1 image_size: 768

multi_aspect_ratio: [1/1, 1/2, 1/3, 2/3, 3/4, 1/5, 2/5, 3/5, 4/5, 1/6, 5/6, 9/16]

grad_accum_steps: 1 updates: 100000 backup_every: 20000 save_every: 2000 warmup_updates: 1 use_fsdp: False adaptive_loss_weight: True

insert torch.cuda.memory_allocated()

print('1-load models start:',torch.cuda.memory_allocated())
models = self.setup_models(extras)
print('2-load models end:',torch.cuda.memory_allocated())

1-load models start: 0 2-load models end: 18517808640

print('3-optimizers start:',torch.cuda.memory_allocated())
optimizers = self.setup_optimizers(extras, models)
print('4-optimizers end:',torch.cuda.memory_allocated())

3-optimizers start: 18517808640 4-optimizers end: 47230638592

conditions = self.get_conditions(batch, models, extras)
print('11-conditons:',torch.cuda.memory_allocated())
latents = self.encode_latents(batch, models, extras)
print('12-encode-latents:',torch.cuda.memory_allocated())

11-conditons: 47248081920 12-encode-latents: 47248118784

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
            pred = models.generator(noised, noise_cond, **conditions)
            print("13-models-generator:",torch.cuda.memory_allocated())

13-models-generator: 60122718720


            loss, loss_adjusted = self.forward_pass(data, extras, models)
            print("14-forward_pass:",torch.cuda.memory_allocated())

            # # BACKWARD PASS
            grad_norm = self.backward_pass(
                i % self.config.grad_accum_steps == 0 or i == max_iters, loss, loss_adjusted,
                models, optimizers, schedulers
            )
            print("15-backward_pass:",torch.cuda.memory_allocated())

14-forward_pass: 59979117056 15-backward_pass: 47247679488

heyalexchoi commented 4 months ago

truly wild, to get max 1 batch size on 80Gb VRAM. something is definitely wrong here. sad, too, seemed like someone made an effort to document and make the repo usable, too.

perhaps they assume you're using multiple GPUs and FSDP if finetuning the big models?

Furthermore, since distributed training is essential when training large models from scratch or doing large finetunes, we have an option to use PyTorch's Fully Shared Data Parallel (FSDP). You can use it by setting use_fsdp: True. Note, that you will need multiple GPUs for FSDP. However, this as mentioned above, this is only needed for large runs. You can still train and finetune our largest models on a powerful single machine.

Update: Just tried FSDP and 2 A100s to see if that would help. Now all that happens is my CPU works very hard and that's it. I think this repo makes a lot of assumptions about your setup: FSDP, multi-GPU, slurm, etc. https://github.com/Stability-AI/StableCascade/issues/71#issuecomment-1974039472