NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 826 forks source link

Makefile: Add nccl_common.h and nccl_tuner.h to INCEXPORTS #1501

Closed martin-belanger closed 3 weeks ago

martin-belanger commented 3 weeks ago

To build an out-of-tree Tuner plugin, we need to have access to definitions from nccl_common.h and nccl_tuner.h. These header files will be installed in /usr/local/include/ by default, which is one of GCC's default search paths for header files.

gcongiu commented 3 weeks ago

The recommended way of writing tuner plugins is to duplicate nccl_tuner.h and other nccl_common.h definitions from NCCL source tree to nccl/tuner.h and nccl/common.h in your tuner plugin source tree. Please take a look at the tuner plugin example: https://github.com/NVIDIA/nccl/tree/master/ext-tuner/example. The process for writing external plugins is described here (Headers Management section) for the network plugin, but also applies to all other plugins, tuner included.

martin-belanger commented 3 weeks ago

Hi @gcongiu - Thanks for the pointers. I was already familiar with the example code, but I had not noticed the "Headers Management" section in the documentation.

I'm just wondering, however, if by duplicating definitions we don't risk having definitions change in upstream NCCL and not know about it in an out-of-tree plugin project? We almost need the equivalent of a "nccl-dev" package so that one can install development headers w/o having to duplicate stuff here and there. Just a thought... :wink:

gcongiu commented 3 weeks ago

Hi @martin-belanger, NCCL plugin APIs are all versioned and backward compatible. Even in the eventuality NCCL internal versions get bumped up, your older tuner plugin remains compatible and functional.

This is guaranteed by a compatibility layer in the tuner code: https://github.com/NVIDIA/nccl/blob/master/src/misc/tuner.cc. Look for functions named ncclTuner_vX_as_vY_* (where X < Y). Tuner API version 3 is backward compatible with version 2. Tuner API version 1 has been deprecated though and is no longer supported. Thus, if your tuner plugin was using that version you should re-implement it to support at least version 2.

In general, you are encouraged to write multiple versions of your plugins, supporting different versions of the API guarantees that your plugin can work with different (older) NCCL versions. Hope this helps.

martin-belanger commented 3 weeks ago

NCCL plugin APIs are all versioned and backward compatible. Even in the eventuality NCCL internal versions get bumped up, your older tuner plugin remains compatible and functional.

Awesome! I am therefore closing this pull-request.