aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
129 stars 51 forks source link

prefer spinlocks where possible #455

Closed aws-nslick closed 2 days ago

aws-nslick commented 2 weeks ago

Issue #, if available:

Description of changes:

pthread_mutex_t is 10x larger than pthread_spinlock_t and blows several of our structs above the threshold for an additional a cache line. We don't expect these locks to be contended under normal usage and the cache impacts are definitely not worth the extra features of a full mutex. In the rare case that any of these locks are contended, spinning is probably the right thing to do in the first place.

First commit reworks the lock macros such that they can accept either a pthread_mutex_t or a pthread_spinlock_t and converts all datapath usages. Follow up commit actually makes the changes in the structs, but because the first commit makes the callsites generic, very few code changes are needed (the exception is to change initialization parameters).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

aws-nslick commented 2 weeks ago
nslick@ip-10-0-3-242:~/aws-ofi-nccl$ # stop 6923322 pthread: handle spinlocks
nslick@ip-10-0-3-242:~/aws-ofi-nccl$ (make clean 2>/dev/null && make -j 2>/dev/null ) > /dev/null && echo $? && llvm-nm -g -C ./src/.libs/nccl_ofi_deque.o
0
                 U free
                 U malloc
                 U nccl_net_ofi_mutex_destroy
                 U nccl_net_ofi_mutex_init
---------------- T nccl_ofi_deque_finalize
---------------- T nccl_ofi_deque_init
                 U ofi_log_function
nslick@ip-10-0-3-242:~/aws-ofi-nccl$ # (magit-rebase-continue)
nslick@ip-10-0-3-242:~/aws-ofi-nccl$ (make clean 2>/dev/null && make -j 2>/dev/null ) > /dev/null && echo $? && llvm-nm -g -C ./src/.libs/nccl_ofi_deque.o
0
                 U free
                 U malloc
                 U nccl_net_ofi_spin_destroy
                 U nccl_net_ofi_spin_init
---------------- T nccl_ofi_deque_finalize
---------------- T nccl_ofi_deque_init
                 U ofi_log_function
aws-nslick commented 2 days ago

Closed as WONTDO, we're going to tackle this in other ways.