Closed aws-nslick closed 1 month ago
Frustrating response, this just aligns us with the rest of the world.
With pytorch: link
With triton: link
And with the library itself:
$ readelf -d /usr/lib/x86_64-linux-gnu/libcuda.so.550.107.02 | grep -i soname
0x000000000000000e (SONAME) Library soname: [libcuda.so.1]
This is explained very well by @Artem-B here on the cmake forums.
leaves us open to more problems down the line
Can you elaborate on this? What problems, exactly?
I'm fine if you want to sync with what latest NCCL does, but let's not fix things that aren't actually broken./
NCCL statically links the cuda runtime and resolves all driver functions through it; swapping that out is a much larger change for us. I don't see why you would want to block taking this simple fix on doing that swap.
bot:aws:retest
Waiting to merge this and https://github.com/ofiwg/libfabric/pull/10365 until @bwbarrett acks
Closing this as it's supplanted by https://github.com/aws/aws-ofi-nccl/commit/d0040f97669fe6b9e20fb64c3c6ccd38b313154a
included in pr: https://github.com/aws/aws-ofi-nccl/pull/618
CTK ships `stubs/libcuda.so', which may potentially be found by dlopen depending on the environment. The stub exists only for the sake of allowing software to resolve the UMD lib at build time w/o needing to have any particular driver actually installed.
In our usage, it is better to explicitly request libcuda.so.1 to avoid ever potentially loading the stub.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.