aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
135 stars 54 forks source link

Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts #278

Closed tmh97 closed 10 months ago

tmh97 commented 11 months ago
  1. Setting FI_OPT_CUDA_API_PERMITTED to false or 0 doesn't seem to make a difference to the config scripts, it still treats it as if FI_OPT_CUDA_API_PERMITTED is set to 1.
  2. I am unable to force usage of the FI_HMEM implementation of my libfabric provider.

I would greatly appreciate some clarity about how FI_OPT_CUDA_API_PERMITTED is used, it's relationship to FI_HMEM, and it's relationship to GPUDirect/GDRCopy if any.

rauteric commented 11 months ago

Hi Thomas,

  1. I'm a bit unclear on what you mean by "make a difference to the config scripts". The plugin sets the option FI_OPT_CUDA_API_PERMITTED to false by using the fi_setopt call to Libfabric (v1.18 and later) at runtime. Which config scripts are you referring to?

  2. FI_HMEM indicates support for Libfabric directly accessing device memory. The plugin by default first tries to find an HMEM-capable provider (see https://github.com/aws/aws-ofi-nccl/blob/b6b76e003d6231de65a4d0e6e3a5a37202bc18ac/src/nccl_ofi_net.c#L384), so it should find one if available. What problem did you encounter in using your FI_HMEM implementation?

Attempting to answer your last question: FI_OPT_CUDA_API_PERMITTED prohibits Libfabric from making calls to the CUDA API, which NCCL forbids. FI_HMEM indicates support for data transfer to/from device memory, i.e., GPUDirect for GPUs. (A provider used with NCCL can't use the CUDA API, even if supporting FI_HMEM.) Finally, gdrcopy is an optional NVIDIA library to improve the performance of GPU memory copies. The Libfabric EFA provider will use it if available.

tmh97 commented 11 months ago

@rauteric Thanks a million for the clarification, this was extremely helpful!

If you would be so kind as to answer a few follow ups:

Thanks for your time!

rauteric commented 11 months ago

Hello. If I understand correctly, your Libfabric provider supports FI_HMEM (i.e., using GPU memory directly in Libfabric APIs) but does not support GPUDirect (i.e., the network device writing directly to GPU memory). In this case, as long as the provider does not make any CUDA calls, the plugin should be able to use the FI_HMEM implementation of your provider. I'm also not aware of any difference between the unit tests and NCCL/nccl-tests in this regard.

If you run nccl-tests with NCCL_DEBUG=TRACE, it should give some helpful info to determine why the plugin is not choosing the FI_HMEM implementation of your provider.

tmh97 commented 10 months ago

@rauteric Thanks for taking the time to answer my questions Eric this has been quite helpful