Open yocontra opened 2 months ago
I think it's worth mentioning that while this patch did work for me for medium models, deadlocks were still so prevalent when trying to train large models that I had to rewrite the trainer in my fork with PyTorch Lightning to make them go away.
So I think there may be some other issues at play here.
@nateraw Hmm interesting - done a couple trainings on the large models (using dora) and didn't have any issues with deadlocks
This may have been hardware specific - I had issues on H100s, but not A100s
I'm training on A40s and no issues so maybe hardware specific, I did see some other people post issues about H100s
This commit: https://github.com/pytorch/pytorch/commit/a8329676273ac12f1fadfbcdd19c500d84998345 Released in torch 2.1.0 breaks this https://github.com/facebookresearch/audiocraft/blob/main/audiocraft/optim/fsdp.py#L126
This ticket (https://github.com/facebookresearch/audiocraft/issues/358) has an implementation that works on torch 2.1.0 - but there should probably be some backward compatibility for a final implementation.