NCCL operation in PyTorch's distributed package needs to set up NCCL communicator so that ranks can talk to one another. To set up the communicator, c10d key-value store needs to be consulted. This is a blocking call, which blocks asyncio's loop. This prevents the loop from scheduling different coroutines. The issue is mitigated by using run_in_executor().
Note that this doesn't seem to be a permanent fix. Depending on timing, blocking appears from time to time and leads to an exception whose example may looks like "torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Socket Timeout".
Description
NCCL operation in PyTorch's distributed package needs to set up NCCL communicator so that ranks can talk to one another. To set up the communicator, c10d key-value store needs to be consulted. This is a blocking call, which blocks asyncio's loop. This prevents the loop from scheduling different coroutines. The issue is mitigated by using run_in_executor().
Note that this doesn't seem to be a permanent fix. Depending on timing, blocking appears from time to time and leads to an exception whose example may looks like "torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Socket Timeout".
Type of Change
Checklist