FSDP broken on newer versions of pytorch

facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

MIT License

20.15k stars 2.01k forks source link

FSDP broken on newer versions of pytorch #450

Open yocontra opened 2 months ago

yocontra commented 2 months ago

This commit: https://github.com/pytorch/pytorch/commit/a8329676273ac12f1fadfbcdd19c500d84998345 Released in torch 2.1.0 breaks this https://github.com/facebookresearch/audiocraft/blob/main/audiocraft/optim/fsdp.py#L126

This ticket (https://github.com/facebookresearch/audiocraft/issues/358) has an implementation that works on torch 2.1.0 - but there should probably be some backward compatibility for a final implementation.

nateraw commented 2 months ago

I think it's worth mentioning that while this patch did work for me for medium models, deadlocks were still so prevalent when trying to train large models that I had to rewrite the trainer in my fork with PyTorch Lightning to make them go away.

So I think there may be some other issues at play here.

yocontra commented 2 months ago

@nateraw Hmm interesting - done a couple trainings on the large models (using dora) and didn't have any issues with deadlocks

nateraw commented 2 months ago

This may have been hardware specific - I had issues on H100s, but not A100s

yocontra commented 2 months ago

I'm training on A40s and no issues so maybe hardware specific, I did see some other people post issues about H100s