Open yangdongchao opened 2 years ago
Good point, we actually did not use DDP for the training but custom distributed routines. We perform manual averaging of the gradients and the model buffers after the backward call using all reduce operators provided by torch.distributed
. See encodec/distrib.py
, in particular sync_grad
and sync_buffers
.
@yangdongchao did you succeed in training the model?
@yangdongchao did you succeed in training the model?
Yes, I success training the model.
@yangdongchao can you share the code if possible?
@yangdongchao can you share the code ?Thank you very much
Good point, we actually did not use DDP for the training but custom distributed routines. We perform manual averaging of the gradients and the model buffers after the backward call using all reduce operators provided by
torch.distributed
. Seeencodec/distrib.py
, in particularsync_grad
andsync_buffers
.
Hi, where do we put sync_grad
and sync_buffers
?
❓ Questions
Hi, when I try to reproduce the training code based on your released part, I meet a question when I try to use multiple-GPU to train, that is, I find that https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L150 and https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L168 will cause the DDP training stop, I find the problem is this code will cause mutilple-GPU to wait each other. Thus, I delete this line code. Now, it can be trained with torch DDP. But I donot know whether this line code will influence the performance? Can you give me some advice whether this line code can be deleted?