NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.64k stars 13.8k forks source link

Build failure: python311Packages.cupy #324351

Open cfhammill opened 3 months ago

cfhammill commented 3 months ago

Steps To Reproduce

nix-shell -p '(import <nixpkgs> {
          inherit system;
          config.allowUnfree = true;
          config.cudaSupport = true;
          config.cudaCapabilities = [
            "7.5"
            "8.0"
          ];}).python311Packages.cupy'

Build log

full log: https://gist.github.com/cfhammill/22616c79dfe5a1d19755bf0eb51cfddf

seemingly relevant sections include

-------- Configuring Module: cusparselt --------
/build/tmp8adp7qy1/a.cpp:1:10: fatal error: cusparseLt.h: No such file or directory
    1 | #include <cusparseLt.h>
      |          ^~~~~~~~~~~~~~
compilation terminated.
command '/nix/store/mpm3i0sbqc9svfch6a17179fs64dz2kv-gcc-wrapper-13.3.0/bin/g++' failed with exit code 1

and

Exception: Could not find libcudart_static.a: /nix/store/cmy1ismvlzgw5qjxr6an84kysgfbc3yj-cudatoolkit-joined-11.8/lib64/libcudart_static.a does not exist

which is interesting because replacing lib64 with lib in the path above does point to a statically compiled libcudart.

Notify maintainers

@samuela @SomeoneSerge @hyphon81

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.0-101-generic, Ubuntu, 22.04.3 LTS (Jammy Jellyfish), nobuild`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.1`
 - nixpkgs: `/mnt/data/home/HammillC/.nix-defexpr/channels/nixpkgs`

Add a :+1: reaction to issues you find important.

cfhammill commented 3 months ago

Patching cupy to find libcudart_static.a in lib instead of lib64 allows the build to proceed, but it still does not succeed in building. I'm getting g++ compilation errors that I don't have context for.

cfhammill commented 3 months ago

downdating to cupy 12.3.0 builds successfully, 12.3.0 doesn't use the static lib so the build failure above doesn't occur.

SomeoneSerge commented 3 months ago

succeed in building. I'm getting g++ compilation errors that I don't have context for.

Hi could you gist them

cfhammill commented 3 months ago

They're pretty much all the same, but with different specific variables not in scope.

cupy_backends/cuda/libs/cutensor.cpp:7219:79: error: ‘cutensorPlan_t’ was not declared in this scope; did you mean ‘cutensorAlgo_t’?                                                                               7219 |         __pyx_v_status = cutensorReduce(((cutensorHandle_t)__pyx_v_handle), ((cutensorPlan_t)__pyx_v_plan), ((void *)__pyx_v_alpha), ((void *)__pyx_v_A), ((void *)__pyx_v_beta), ((void *)__pyx_v_C), ((void *)__pyx_v_D), ((void *)__pyx_v_workspace), __pyx_v_workspaceSize, ((cudaStream_t)__pyx_v_stream));
      |                                                                               ^~~~~~~~~~~~~~                                                                                                                    |                                                                               cutensorAlgo_t                                                                                                              cupy_backends/cuda/libs/cutensor.cpp:7219:94: error: expected ‘)’ before ‘__pyx_v_plan’                                                                                                                            7219 |         __pyx_v_status = cutensorReduce(((cutensorHandle_t)__pyx_v_handle), ((cutensorPlan_t)__pyx_v_plan), ((void *)__pyx_v_alpha), ((void *)__pyx_v_A), ((void *)__pyx_v_beta), ((void *)__pyx_v_C), ((void *)__pyx_v_D), ((void *)__pyx_v_workspace), __pyx_v_workspaceSize, ((cudaStream_t)__pyx_v_stream));
      |                                                                             ~                ^~~~~~~~~~~~                                                                                                       |                                                                                              )                                                                                                            cupy_backends/cuda/libs/cutensor.cpp: In function ‘PyObject* __pyx_f_13cupy_backends_4cuda_4libs_8cutensor_destroyOperationDescriptor(intptr_t, int)’:                                                            cupy_backends/cuda/libs/cutensor.cpp:7482:63: error: ‘cutensorOperationDescriptor_t’ was not declared in this scope; did you mean ‘cutensorContractionDescriptor_t’?
 7482 |         __pyx_v_status = cutensorDestroyOperationDescriptor(((cutensorOperationDescriptor_t)__pyx_v_desc));                                                                                                     |                                                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                     |                                                               cutensorContractionDescriptor_t                                                                                                             cupy_backends/cuda/libs/cutensor.cpp:7482:93: error: expected ‘)’ before ‘__pyx_v_desc’                                                                                                                            7482 |         __pyx_v_status = cutensorDestroyOperationDescriptor(((cutensorOperationDescriptor_t)__pyx_v_desc));                                                                                                     |                                                             ~                               ^~~~~~~~~~~~                                                                                                        |                                                                                             )            

my guess is the cutensor version isn't high enough.

cfhammill commented 3 months ago

@SomeoneSerge updating to cutensor 2.0.2 fixed the build in combination with my cupy patch. Is there an established process for editing the manifest files: https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/cuda-modules/cutensor/manifests/ to add the hashes/sizes for all archs? I hacked in linux-x86_64 by hand to get it to work.

berquist commented 1 month ago

This doesn't seem to be an issue anymore on b4bc024641b3c877bd0ab7b45c34099da8279d53 for python310Packages.cupy, python311Packages.cupy, or python312Packages.cupy.