Closed tmh97 closed 10 months ago
Hi Thomas,
I'm a bit unclear on what you mean by "make a difference to the config scripts". The plugin sets the option FI_OPT_CUDA_API_PERMITTED
to false by using the fi_setopt
call to Libfabric (v1.18 and later) at runtime. Which config scripts are you referring to?
FI_HMEM
indicates support for Libfabric directly accessing device memory. The plugin by default first tries to find an HMEM-capable provider (see https://github.com/aws/aws-ofi-nccl/blob/b6b76e003d6231de65a4d0e6e3a5a37202bc18ac/src/nccl_ofi_net.c#L384), so it should find one if available. What problem did you encounter in using your FI_HMEM
implementation?
Attempting to answer your last question: FI_OPT_CUDA_API_PERMITTED
prohibits Libfabric from making calls to the CUDA API, which NCCL forbids. FI_HMEM
indicates support for data transfer to/from device memory, i.e., GPUDirect for GPUs. (A provider used with NCCL can't use the CUDA API, even if supporting FI_HMEM
.) Finally, gdrcopy is an optional NVIDIA library to improve the performance of GPU memory copies. The Libfabric EFA provider will use it if available.
@rauteric Thanks a million for the clarification, this was extremely helpful!
If you would be so kind as to answer a few follow ups:
Thanks for your time!
Hello. If I understand correctly, your Libfabric provider supports FI_HMEM
(i.e., using GPU memory directly in Libfabric APIs) but does not support GPUDirect (i.e., the network device writing directly to GPU memory). In this case, as long as the provider does not make any CUDA calls, the plugin should be able to use the FI_HMEM
implementation of your provider. I'm also not aware of any difference between the unit tests and NCCL/nccl-tests in this regard.
If you run nccl-tests with NCCL_DEBUG=TRACE
, it should give some helpful info to determine why the plugin is not choosing the FI_HMEM
implementation of your provider.
@rauteric Thanks for taking the time to answer my questions Eric this has been quite helpful
I would greatly appreciate some clarity about how FI_OPT_CUDA_API_PERMITTED is used, it's relationship to FI_HMEM, and it's relationship to GPUDirect/GDRCopy if any.