cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

fix: asyncio-friendly nccl operations #52

Closed myungjin closed 2 months ago

myungjin commented 2 months ago

Description

NCCL operation in PyTorch's distributed package needs to set up NCCL communicator so that ranks can talk to one another. To set up the communicator, c10d key-value store needs to be consulted. This is a blocking call, which blocks asyncio's loop. This prevents the loop from scheduling different coroutines. The issue is mitigated by using run_in_executor().

Note that this doesn't seem to be a permanent fix. Depending on timing, blocking appears from time to time and leads to an exception whose example may looks like "torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Socket Timeout".

Type of Change

Checklist