Open meditans opened 7 months ago
we may be able to shortcut the whole thing by landing https://github.com/NixOS/nixpkgs/pull/285249 (currently a draft) which presumably won't require this patching
Because the new version of torch
doesn't have the dependency on openai-triton
?
I'd still add a patch to test if addDriverRunpath.driverLink
exists, in addition to checking the environment variable. Otherwise somebody needs to set the variable, which we can't conveniently do in a python module
As for the upstream, we could consider opening an issue suggesting that they use dlopen()
+dlinfo()
instead of ldconfig
, and that maybe they eventually transition to rely on nvidia-container-toolkit/CDI (which we/nixpkgs probably should support as the default means of discovering the driver, assuming that nvidia eventually removes their ldconfig hacks too)
Describe the bug
In the current version of
openai-triton
,v2.1.0
, which is used to build pytorch, there's a function that callsldconfig -p
; in NixOS that means trying to open a cache file like/nix/store/7jiqcrg061xi5clniy7z5pvkc4jiaqav-glibc-2.38-27/etc/ld.so.cache
and crashing. I firstly encountered this behavior calling a different python library which usesopenai-triton
.Steps To Reproduce
You can reproduce the behavior using this flake:
after which, you can have this interaction in the python repl:
Expected behavior
The python process shouldn't crash. The
openai-triton
library should be able to get the right cuda libraries.Additional context
I noticed that some commits were done that alleviate the issue in
openai-triton
, but they were made after thev2.1.0
release. They make so that when the environment variableTRITON_LIBCUDA_PATH
is defined, the content of that variable is used. I created a flake that patches openai-triton with these two commits, and builds torch on top of the modified openai-triton.With this modification, I am able to use the library as intended, but I think a more permanent fix should be included in this and
torch-bin
, because compilingtorch
+openai-triton
is incredibly time-consuming.Notify maintainers
@NixOS/cuda-maintainers @SomeoneSerge @Madouura
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result.Add a :+1: reaction to issues you find important.