aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147 stars 56 forks source link

platform-aws: bump libfabric requirement to 1.22 #713

Closed aws-nslick closed 1 day ago

aws-nslick commented 1 day ago

p5en bring-up has exposed a set of interconnected bugs across libfabric and the plugin to the point that it's become difficult to reason about how fixes apply retroactively to older libfabric releases. Bump the required version to 1.22 when building with platform-aws for our 1.13.x series to enforce that users aren't running a mixture of versions that developers haven't tested.

Signed-off-by: Nicholas Sielicki nslick@amazon.com

rauteric commented 1 day ago

We have a corresponding runtime check: https://github.com/aws/aws-ofi-nccl/blob/ebcf82b69d0616a2bc882412ca0154b824665870/src/nccl_ofi_net.c#L611-L622 -- Shouldn't we update that as well (in case someone builds with a newer Libfabric and runs with an older version.)

aws-nslick commented 1 day ago

I think it's reasonable to add a function that is just purely enforcing a minimum fi version at runtime, but I don't think that's the role that this specific check actually intended to enforce, if that makes sense.

My read of that check is that it's checking for 1.18 because it needs it for FI_OPT_CUDA_API_PERMITTED, and the print really means to say, "EFA provider on AWS platforms requires FI_OPT_CUDA_API_PERMITTED". If you buy that argument, then we should leave it alone so that we don't find ourselves at some point in the future with code that associates FI_OPT_CUDA_API_PERMITTED with 1.22 when the two are largely unrelated.

aws-nslick commented 1 day ago

you know what, I take that back -- the code around FI_OPT_CUDA_API_PERMITTED is also guarded with version checks, so it's not unfair for this to read as an independent runtime check unrelated to the feature. I'll amend, thanks.