Cannot save checkpoint when using deepspeed

tellurion-kanata commented 1 year ago

Bug description

Hi, I tried to train my model (very similar to the official implementation of stablediffusion v2.1) using "deepspeed_stage_2" strategy in lightning trainer, but I found there's something wrong. I cannot save model weights using trainer.save_checkpoint when training using multiple gpus. (No error logs, no outputs, just stuck without any hints)

I want to know if there're any solutions to this bug, or alternative ways to save the model weights. Thank you!

What version are you seeing the problem on?

v1.9, v2.0, v2.1

How to reproduce the bug

trainer = pl.Trainer(
        max_epochs              = opt.niter + opt.niter_decay,
        devices                 = opt.gpus,
        default_root_dir        = opt.ckpt_path,
        callbacks               = [ModelCheckpoint(opt.ckpt_path, opt.ckpt_path+"latest", every_n_train_steps=1)],
        precision               = '16',
        enable_checkpointing    = True,
        strategy                = "deepspeed_stage_2",
    )

Error messages and logs

# Error messages and logs here please

Environment

device: DGX Station A100 (A100 SXM 40G x4)
cuda 12.2
lightning version:  2.0.1 (I previously tried latest 2.1.0, but I found that i cannot save models even using normal DDP training)
torch version:  2.1.0+cu12.1
deepspeed version: 0.11.1

More info

No response

ZwormZ commented 1 year ago

I also encountered the same problem, have you resolved it?

tellurion-kanata commented 1 year ago

I also encountered the same problem, have you resolved it?

No. I'm still waiting for a solution.

wangyifan commented 11 months ago

Does this happen only during multi GPUs setting?

tellurion-kanata commented 11 months ago

Does this happen only during multi GPUs setting?

Yes. I have switched to the deepspeed plugin in huggingface's accelerate.

RenShuhuai-Andy commented 10 months ago

@ydk-tellurion hi, I encountered the same problem using huggingface's accelerate, it runs well with one GPU, but will hangs on multigpus when go through the save_checkpoint function https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L3050C9-L3050C24

do you have any suggestions?

tellurion-kanata commented 10 months ago

@ydk-tellurion hi, I encountered the same problem using huggingface's accelerate, it runs well with one GPU, but will hangs on multigpus when go through the save_checkpoint function https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L3050C9-L3050C24

do you have any suggestions?

Hi, I'm using huggingface accelerator to save the weights if this helps. Script like this:

if accelerate.is_local_main_process: accelerate.wait_for_everyone() accelerate.save(model.state_dict(), filename, safe_serialization=True)

gaozhangyang commented 3 months ago

I meet the same problem. Who can help me?

Lightning-AI / pytorch-lightning