cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

feat: concurrent world initialization #21

Closed myungjin closed 4 months ago

myungjin commented 4 months ago

Description

We enable concurrent world initialization by using asyncio's run_in_executor and concurrent.futures.ThreadPoolExecutor.

Type of Change

Checklist