Open jjsjann123 opened 1 year ago
With our recent added support in distributed primitives, we introduced dependency against NCCL/UCC in our code base.
Currently, the dependency on pytorch's distributed (and therefore on UCC, NCCL and GLOO) is contained in multidevice/communicator.cpp
- We should have our source and build file better organized and allow build like USE_MULTIDEVICE=0 to avoid dependency on NCCL/UCC and future libraries required for multi device support.
Sure, currently USE_MULTIDEVICE
would coincide with USE_DISTRIBUTED
so we don't strictly need it, but maybe at some point we will
- Explicit documentation in the build guide and refactor on build system to allow easy installation of dependencies.
Let me know how I can help with this
With our recent added support in distributed primitives, we introduced dependency against NCCL/UCC in our code base.
https://github.com/NVIDIA/Fuser/tree/main/csrc/multidevice https://github.com/NVIDIA/Fuser/blob/fb9845e728136bc2ee7fd5b924440896303c1334/CMakeLists.txt#L140-L144
This part is not well tested against various build of pytorch (pytorch build with
USE_DISTRIBUTED=0
have been causing issue: i.e. https://github.com/NVIDIA/Fuser/pull/598#issuecomment-1639808010).More over, the dependency on NCCL/UCC is currently required I believe. Which arguably isn't necessary.
A couple actionable items:
USE_MULTIDEVICE=0
to avoid dependency on NCCL/UCC and future libraries required for multi device support.