aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147 stars 56 forks source link

fix(cuda): avoid loading stub #581

Closed aws-nslick closed 1 month ago

aws-nslick commented 2 months ago

CTK ships `stubs/libcuda.so', which may potentially be found by dlopen depending on the environment. The stub exists only for the sake of allowing software to resolve the UMD lib at build time w/o needing to have any particular driver actually installed.

In our usage, it is better to explicitly request libcuda.so.1 to avoid ever potentially loading the stub.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

aws-nslick commented 2 months ago

Frustrating response, this just aligns us with the rest of the world.

With pytorch: link

With triton: link

And with the library itself:

$ readelf -d /usr/lib/x86_64-linux-gnu/libcuda.so.550.107.02 | grep -i soname
 0x000000000000000e (SONAME)             Library soname: [libcuda.so.1]

This is explained very well by @Artem-B here on the cmake forums.

leaves us open to more problems down the line

Can you elaborate on this? What problems, exactly?

I'm fine if you want to sync with what latest NCCL does, but let's not fix things that aren't actually broken./

NCCL statically links the cuda runtime and resolves all driver functions through it; swapping that out is a much larger change for us. I don't see why you would want to block taking this simple fix on doing that swap.

aws-nslick commented 2 months ago

bot:aws:retest

aws-nslick commented 2 months ago

Waiting to merge this and https://github.com/ofiwg/libfabric/pull/10365 until @bwbarrett acks

aws-nslick commented 1 month ago

Closing this as it's supplanted by https://github.com/aws/aws-ofi-nccl/commit/d0040f97669fe6b9e20fb64c3c6ccd38b313154a

included in pr: https://github.com/aws/aws-ofi-nccl/pull/618