NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
256 stars 51 forks source link

Dependency on NCCL/UCC is implicit in the build #608

Open jjsjann123 opened 1 year ago

jjsjann123 commented 1 year ago

With our recent added support in distributed primitives, we introduced dependency against NCCL/UCC in our code base.

https://github.com/NVIDIA/Fuser/tree/main/csrc/multidevice https://github.com/NVIDIA/Fuser/blob/fb9845e728136bc2ee7fd5b924440896303c1334/CMakeLists.txt#L140-L144

This part is not well tested against various build of pytorch (pytorch build with USE_DISTRIBUTED=0 have been causing issue: i.e. https://github.com/NVIDIA/Fuser/pull/598#issuecomment-1639808010).

More over, the dependency on NCCL/UCC is currently required I believe. Which arguably isn't necessary.

A couple actionable items:

  1. We should have our source and build file better organized and allow build like USE_MULTIDEVICE=0 to avoid dependency on NCCL/UCC and future libraries required for multi device support.
  2. Explicit documentation in the build guide and refactor on build system to allow easy installation of dependencies.
samnordmann commented 1 year ago

With our recent added support in distributed primitives, we introduced dependency against NCCL/UCC in our code base.

Currently, the dependency on pytorch's distributed (and therefore on UCC, NCCL and GLOO) is contained in multidevice/communicator.cpp

  1. We should have our source and build file better organized and allow build like USE_MULTIDEVICE=0 to avoid dependency on NCCL/UCC and future libraries required for multi device support.

Sure, currently USE_MULTIDEVICE would coincide with USE_DISTRIBUTED so we don't strictly need it, but maybe at some point we will

  1. Explicit documentation in the build guide and refactor on build system to allow easy installation of dependencies.

Let me know how I can help with this