Closed igor-krawczuk closed 1 year ago
Hi. CGX does not support more than one CGXState and it is only used in all_reduce hook. I don't think that several allreduce hooks make any sense in DDP context. We developed CGX for DDP-AllReduce-like workload so we expect that only gradients are synchronized.
If you don't need breaking the communicated buckets into separate layers and filtering of small layers, you can avoid using CGXState and control compression with environment variables. If you don't want to compress some parts of your communication, you can create another torch.distributed ProcessGroup (e.g. with nccl backend) and use it for this type of communications.
Hi, okay, so if I want to compress discriminator gradients and generator gradients one after another, it should work without using the state? Or would it still expect all tensors to be sent every time and possibly error out if it only encounters e.g. the discriminator backwards call?
Without CGXState CGX will compress all tensors (except super small, with less than 16 elements) that are allreduced (for which torch.distributed.all_reduce
is called).
Thanks, removing the state and using the environment variables did the trick
Hi, we had previously done experiments with the code version of august2022, which is now offline. We are training a WGAN-GP with simulatneous ExtraAdam, and I was wondering what would be the correct usage of the new api. I'm currently trying with
However, we continuously get the error