Gradient Overflow with Specific GPU Combinations in Multi-GPU Setup (NVIDIA RTX 3090)

I'm facing an issue with gradient overflow when training a model using specific combinations of GPUs in a multi-GPU setup with 4 identical NVIDIA RTX 3090 GPUs on a single machine.

The issue occurs only when using GPU 2 and GPU 3 simultaneously. When I use GPUs 0, 1, and 2, or GPUs 0, 1, and 3, the training works fine. However, when using GPUs 2 and 3 together, I consistently encounter "Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to..." at the start of training.

!nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X PXB PXB PXB 0-25,52-77 0 GPU1 PXB X PXB PXB 0-25,52-77 0 GPU2 PXB PXB X PIX 0-25,52-77 0 GPU3 PXB PXB PIX X 0-25,52-77 0

Changing the GPU combinations to avoid using GPUs 2 and 3 together solves the issue. The problem seems tied to the PCIe connection between GPU 2 and GPU 3, but I’m not sure how to further address it.

NVIDIA / apex

Gradient Overflow with Specific GPU Combinations in Multi-GPU Setup (NVIDIA RTX 3090) #1841