aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
133 stars 52 forks source link

GPU direct #313

Open tks2004 opened 7 months ago

tks2004 commented 7 months ago

If we need to enable GPU direct, is there any FI environment to be enabled to utilize that feature.

rauteric commented 7 months ago

Applications request GPU direct capability from Libfabric by adding the FI_HMEM flag when calling fi_getinfo, as the plugin does here: https://github.com/aws/aws-ofi-nccl/blob/e704fd9dbd0620905a1b900d3d280f9e50daee10/src/nccl_ofi_net.c#L374

Before Libfabric 1.18, the Libfabric EFA provider also required an environment variable, FI_EFA_USE_DEVICE_RDMA=1, to enable GPU direct. For Libfabric 1.18+ and Aws-ofi-nccl 1.7.0+, this is no longer required. See also: https://github.com/aws/aws-ofi-nccl/blob/master/doc/efa-env-var.md, mostly relevant to EFA provider.