NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.18k stars 13.45k forks source link

Build failure: python311Packages.tensorflowWithCuda #317090

Open aryanjassal opened 2 months ago

aryanjassal commented 2 months ago

Steps To Reproduce

Steps to reproduce the behavior:

  1. build python311Packages.tensorflowWithCuda

Build log

https://gist.github.com/aryanjassal/cb19d4d335743504dd35404d153ed663

Additional context

My flake.nix file which compiles this as a dependency:

{
  description = "Object detection via image segmentation using ResNet-18";
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs";
    flake-utils.url = "github:numtide/flake-utils";
  };
  outputs = { self, nixpkgs, flake-utils }:
    flake-utils.lib.eachDefaultSystem (system:
    let
      pkgs = import nixpkgs {
        inherit system;
        config.allowUnfree = true;
      };
      tensorflow = pkgs.python3Packages.tensorflowWithCuda.overrideAttrs(oldAttrs: {
        buildInputs = oldAttrs.buildInputs ++ [ pkgs.cudatoolkit pkgs.linuxPackages.nvidia_x11 ];
        preBuild = ''
          export CUDA_PATH=${pkgs.cudatoolkit}
          export EXTRA_LDFLAGS="-L/lib -L${pkgs.linuxPackages.nvidia_x11}/lib"
          export EXTRA_CCFLAGS="-I/usr/include -I${pkgs.cudatoolkit}/lib"
        '';
      });
      keras = with pkgs.python3Packages; buildPythonPackage rec {
        pname = "keras";
        version = "2.13.1";
        src = fetchPypi {
          inherit pname version;
          sha256 = "sha256-XfEswkGgFaEbZd20UsDusnRPziHZtUukjbh0klaMzGg=";
        };
        buildInputs = [
          numpy
          tensorflow
        ];
        doCheck = false;
      };
    in {
      devShells.default = pkgs.mkShell {
        name = "object-detector";
        propogatedNativeBuildInputs = [
          tensorflow
        ];
        buildInputs = with pkgs.python3Packages; with pkgs; [
          setuptools
          cudatoolkit
          pip
          tensorflow
          keras
          pandas
          numpy
          ipython
          opencv4
          matplotlib
          h5py
          jsonschema
          scikit-image
        ];
      };
    }
  );
}

Previously, I was defining tensorflow as follows:

tensorflow = pkgs.python3Packages.tensorflowWithCuda;

But on some revisions it would error out with the following log:

nix-repl> :b nixpkgs.python3Packages.tensorflowWithCuda              
error: builder for '/nix/store/h5dq6brjf3hpzcm09zqaqspa6jqafhcm-nccl-2.20.5-1.drv' failed with exit code 2;
       last 10 log lines:
       > nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
       > make[2]: Leaving directory '/build/source/src/device'
       > Linking    libnccl.so.2.20.5                   > /build/source/build/lib/libnccl.so.2.20.5
       > Archiving  libnccl_static.a                    > /build/source/build/lib/libnccl_static.a
       > /nix/store/hqvni28zpibl6jsqqimcvng6h6qm58xy-binutils-2.41/bin/ld: cannot find -lcudart_static: No such file or directory
       > collect2: error: ld returned 1 exit status
       > make[1]: *** [Makefile:79: /build/source/build/lib/libnccl.so.2.20.5] Error 1
       > make[1]: *** Waiting for unfinished jobs....
       > make[1]: Leaving directory '/build/source/src'
       > make: *** [Makefile:25: src.build] Error 2
       For full logs, run 'nix log /nix/store/h5dq6brjf3hpzcm09zqaqspa6jqafhcm-nccl-2.20.5-1.drv'.
error: 1 dependencies of derivation '/nix/store/77834w7g0r1xsixm3ninwvm4vakra96g-python3.11-tensorflow-gpu-2.13.0.drv' failed to build

So, I presumed that it could not find some CUDA libraries, so I provided the cudatoolkit path on CUDA_PATH, EXTRA_CCFLAGS, and EXTRA_LDFLAGS, which let me avoid that error.

The output of nix flake metadata, which shows the nixpkgs and flake-utils revisions.

[user@system:~]$ nix flake metadata
Resolved URL:  git+file:///mnt/root/object-detector
Locked URL:    git+file:///mnt/root/object-detector
Description:   Object detection via image segmentation using ResNet-18
Path:          /nix/store/cx7hfc6kysdlmar97i0i9bkgyvvw4y8v-source
Revision:      0e0f2c5f5719a0e02e9be935b0e0be211f6c4aab-dirty
Last modified: 2024-04-18 17:20:26
Inputs:
├───flake-utils: github:numtide/flake-utils/b1d9ab70662946ef0850d488da1c9019f3a9752a
│   └───systems: github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e
└───nixpkgs: github:NixOS/nixpkgs/0b2a090503b08d27bc82f923eb562805f35eb498

Notify maintainers

@abbradar

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.6.25, Dell Precision 3480, noversion, nobuild`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.2`
 - nixpkgs: `/etc/nixpkgs`

Add a :+1: reaction to issues you find important.

CMCDragonkai commented 2 months ago

Can you see if:

You need to activate the cudaSupport to be true to get the actual GPU activation.

I suspect tensorflowWithCuda is just broken. A few years back there was no need to provide all those environment variables, the package should just work, if it doesn't I don't think the package maintainers have done it correctly. I reckon there should be some sort of shell hook that automatically sets up everything that is required so one can start hacking in a Python 3 shell immediately without messing with random env variables.

aryanjassal commented 2 months ago

I have tried using both tensorflowWithCuda and tensorflow-bin and nothing worked. tensorflowWithCuda failed to compile altogether, throwing a hash mismatch error when building through bazel, and tensorflow-bin just couldn't find or load CUDA drivers.

I have ensured that my system is set up correctly, as running pytorchWithCuda works perfectly and can also detect my GPU without issues. I have also tried running a demo script and used nvidia-smi to confirm that my GPU is, in fact, being used.