openai-triton crashes trying to use `ldconfig -p`

meditans commented 7 months ago

Describe the bug

In the current version of openai-triton, v2.1.0, which is used to build pytorch, there's a function that calls ldconfig -p; in NixOS that means trying to open a cache file like /nix/store/7jiqcrg061xi5clniy7z5pvkc4jiaqav-glibc-2.38-27/etc/ld.so.cache and crashing. I firstly encountered this behavior calling a different python library which uses openai-triton.

Steps To Reproduce

You can reproduce the behavior using this flake:

{
  description =
    "Trying to install torch and openai, as a first step to patching openai";

  inputs.nixpkgs.url = "nixpkgs/nixos-unstable";

  outputs = { self, nixpkgs }:
    let
      pkgs = (import nixpkgs {
        system = "x86_64-linux";
        config = {
          cudaSupport = true;
          allowUnfree = true;
        };
      });

    in {
      devShells.x86_64-linux.default = with pkgs;
        mkShell {
          buildInputs =
            [ (python3.withPackages (ps: with ps; [ torch openai-triton ])) ];
        };
    };
}

after which, you can have this interaction in the python repl:

$ python
Python 3.11.7 (main, Dec  4 2023, 18:10:11) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import triton
>>> triton.common.libcuda_dirs()
ldconfig: Can't open cache file /nix/store/7jiqcrg061xi5clniy7z5pvkc4jiaqav-glibc-2.38-27/etc/ld.so.cache
: No such file or directory
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nix/store/yxnn9wngdiyym6i3ch2gznv3069aj25k-python3-3.11.7-env/lib/python3.11/site-packages/triton/common/build.py", line 21, in libcuda_dirs
    libs = subprocess.check_output(["ldconfig", "-p"]).decode()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/w4fvvhkzb0ssv0fw5j34pw09f0qw84w8-python3-3.11.7/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ldconfig', '-p']' returned non-zero exit status 1.

Expected behavior

The python process shouldn't crash. The openai-triton library should be able to get the right cuda libraries.

Additional context

I noticed that some commits were done that alleviate the issue in openai-triton, but they were made after the v2.1.0 release. They make so that when the environment variable TRITON_LIBCUDA_PATH is defined, the content of that variable is used. I created a flake that patches openai-triton with these two commits, and builds torch on top of the modified openai-triton.

{
  description = "Installing torch and openai, with a patched openai";

  inputs.nixpkgs.url = "nixpkgs/nixos-unstable";

  outputs = { self, nixpkgs }:
    let
      pkgs = (import nixpkgs {
        system = "x86_64-linux";
        config = {
          cudaSupport = true;
          allowUnfree = true;
        };
        overlays = [
          (final0: prev0: rec {
            python3 = prev0.python3.override {
              packageOverrides = final: prev: rec {
                openai-triton = prev.openai-triton.overrideAttrs (oldAttrs: {
                  patches = oldAttrs.patches ++ [
                    (prev0.fetchpatch {
                      url =
                        "https://github.com/openai/triton/commit/871ec2ad37c1e521b0b6b43555e99c7702638976.patch";
                      sha256 =
                        "sha256-557jk38vY0S1ozL2hN67LsvwccDsM1hcTEjtPU/vC/8=";
                    })
                    (prev0.fetchpatch {
                      url =
                        "https://github.com/openai/triton/commit/46452fae3bb072b9b8da4d1529a0af7c8f233de5.patch";
                      sha256 =
                        "sha256-if1lewXY+uzZ8D9//TSo4GZ9XFI9c2+UtcSEtVNzTrQ=";
                    })
                  ];
                });
              };
            };
          })
        ];
      });

    in {
      devShells.x86_64-linux.default = with pkgs;
        mkShell {
          buildInputs =
            [ (python3.withPackages (ps: with ps; [ torch openai-triton ])) ];
        };
    };
}

With this modification, I am able to use the library as intended, but I think a more permanent fix should be included in this and torch-bin, because compiling torch+openai-triton is incredibly time-consuming.

Notify maintainers

@NixOS/cuda-maintainers @SomeoneSerge @Madouura

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

$ nix run nixpkgs#nix-info -- -m
 - system: `"x86_64-linux"`
 - host os: `Linux 5.10.205, NixOS, 24.05 (Uakari), 24.05.20231227.cfc3698`
 - multi-user?: `no`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.1`
 - nixpkgs: `not found`

Add a :+1: reaction to issues you find important.

samuela commented 7 months ago

we may be able to shortcut the whole thing by landing https://github.com/NixOS/nixpkgs/pull/285249 (currently a draft) which presumably won't require this patching

meditans commented 7 months ago

Because the new version of torch doesn't have the dependency on openai-triton?

SomeoneSerge commented 7 months ago

I'd still add a patch to test if addDriverRunpath.driverLink exists, in addition to checking the environment variable. Otherwise somebody needs to set the variable, which we can't conveniently do in a python module

As for the upstream, we could consider opening an issue suggesting that they use dlopen()+dlinfo() instead of ldconfig, and that maybe they eventually transition to rely on nvidia-container-toolkit/CDI (which we/nixpkgs probably should support as the default means of discovering the driver, assuming that nvidia eventually removes their ldconfig hacks too)

NixOS / nixpkgs