aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147 stars 56 forks source link

Simplify locking and enable FI_THREAD_DOMAIN #666

Closed bwbarrett closed 1 month ago

bwbarrett commented 1 month ago

This patch series switches from a fine grained locking scheme (that creates thousands of locks in the rdma transport case) to a simple device and domain locking scheme, where the lock is held for the majority of the communication calls. This allows us to disable most of the locking in Libfabric by supporting FI_THREAD_DOMAIN.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

bwbarrett commented 1 month ago

This PR depends on https://github.com/aws/aws-ofi-nccl/pull/665. I also need to re-run performance tests.