Closed tellurion-kanata closed 11 months ago
I also encountered the same problem, have you resolved it?
I also encountered the same problem, have you resolved it?
No. I'm still waiting for a solution.
Does this happen only during multi GPUs setting?
Does this happen only during multi GPUs setting?
Yes. I have switched to the deepspeed plugin in huggingface's accelerate.
@ydk-tellurion hi, I encountered the same problem using huggingface's accelerate, it runs well with one GPU, but will hangs on multigpus when go through the save_checkpoint
function https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L3050C9-L3050C24
do you have any suggestions?
@ydk-tellurion hi, I encountered the same problem using huggingface's accelerate, it runs well with one GPU, but will hangs on multigpus when go through the
save_checkpoint
function https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L3050C9-L3050C24do you have any suggestions?
Hi, I'm using huggingface accelerator to save the weights if this helps. Script like this:
if accelerate.is_local_main_process: accelerate.wait_for_everyone() accelerate.save(model.state_dict(), filename, safe_serialization=True)
I meet the same problem. Who can help me?
Bug description
Hi, I tried to train my model (very similar to the official implementation of stablediffusion v2.1) using "deepspeed_stage_2" strategy in lightning trainer, but I found there's something wrong. I cannot save model weights using
trainer.save_checkpoint
when training using multiple gpus. (No error logs, no outputs, just stuck without any hints)I want to know if there're any solutions to this bug, or alternative ways to save the model weights. Thank you!
What version are you seeing the problem on?
v1.9, v2.0, v2.1
How to reproduce the bug
Error messages and logs
Environment
More info
No response