[Tracking] ROCm packages

Madouura commented 2 years ago

Tracking issue for ROCm derivations.

moar packages

Key

Package
- Dependencies

WIP

-

Ready

-

TODO

[ ] Add CUDA options to all derivations that can use a CUDA backend
[ ] Implement tests using a system like #200757
- Use the solution from #261155
[ ] Implement building any missed documentation
[ ] Implement ROCm into tensorflow
[ ] Add hipBLASLt (required by https://github.com/NixOS/nixpkgs/pull/288644#issuecomment-2045283055)
[ ] Optimize closures
- [ ] https://github.com/NixOS/nixpkgs/issues/276846
- [ ] https://github.com/NixOS/nixpkgs/issues/242401
[ ] https://github.com/NixOS/nixpkgs/issues/301937

Merged

[x] #197838
[x] #198770
[x] #199324
[x] #200705
[x] #199574
[x] #202373
[x] #202649
[x] #202685
[x] #203235
[x] #202476
[x] #203412
[x] #204378
[x] #206421
[x] #206995
[x] #213208
[x] #214339
[x] #214606
[x] #258328
[x] #260299
[x] #261180
[x] #261578
[x] #262823
[x] #262750
[x] #262798
[x] #274980

ROCm-related

261155
263048

Notes

Update command: nix-shell maintainers/scripts/update.nix --argstr commit true --argstr keep-going true --arg predicate '(path: pkg: builtins.elem (pkg.pname or null) [ "rocm-llvm-llvm" "rocm-core" "rocm-cmake" "rocm-thunk" "rocm-smi" "rocm-device-libs" "rocm-runtime" "rocm-comgr" "rocminfo" "clang-ocl" "rdc" "rocm-docs-core" "hip-common" "hipcc" "clr" "hipify" "rocprofiler" "roctracer" "rocgdb" "rocdbgapi" "rocr-debug-agent" "rocprim" "rocsparse" "rocthrust" "rocrand" "rocfft" "rccl" "hipcub" "hipsparse" "hipfort" "hipfft" "tensile" "rocblas" "rocsolver" "rocwmma" "rocalution" "rocmlir" "hipsolver" "hipblas" "miopengemm" "composable_kernel" "half" "miopen" "migraphx" "rpp-hip" "mivisionx-hip" "hsa-amd-aqlprofile-bin" ])'

Won't implement

ROCmValidationSuite
- Too many assumptions, not going to rewrite half the cmake files
rocm_bandwidth_test
- Not really needed, will implement on request
atmi
- Out-of-date
aomp
- We basically already do this
Implement strictDeps for all derivations
- Seems pointless now and I don't see many other derivations doing this

kurnevsky commented 1 year ago

I believe your GPU is GFX11?

Yes.

Madouura commented 1 year ago

New tensorflow-rocm WIP at https://github.com/Madouura/nixpkgs/commit/344aa780809455f545eae4895bed72e3e9af0de6. Current blocking factor is an LLVM mismatch. Most likely, tensorflow 2.13.0 isn't nearly up-to-date enough with rocm 5.7.1.

Madouura commented 1 year ago

@Flakebi I have some basic impureTests stuff at https://github.com/Madouura/nixpkgs/blob/pr/rocm/pkgs/development/rocm-modules/5/rocm-thunk/generic.nix as well as some other stuff. Tell me if you think this is the best way to go forward please.

Flakebi commented 1 year ago

Nice! I think we shouldn’t add anything to <package>.tests that is not also runnable as a (pure) nix test because these get parsed by scripts and bots. Why not set the testScript = "${rocmPackages_5.rocm-smi-variants.shared}/bin/rocm-smi"? That would be easier to build most tests :)

I think the rocminfo test can check the output that it actually detected something (like rocminfo | grep -E 'Device Type: +GPU' and rocm_agent_enumerator | grep -E 'gfx[^0]'). That makes sure we don’t ship something that’s unable to find GPUs.

Madouura commented 1 year ago

I'm going to take a bit of a break from ROCm and work on another project. I'll try to work on the major updates/upgrades here and there, but until early-mid next year the other project is going to be my focus. If there's any major issues or if you just need something explained, don't hesitate to ping me.

gjz010 commented 11 months ago

Hi. Thanks for maintaining rocm for nix!

When I try to use torchWithRocm I got the following error:

MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' naive_conv.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: naive_conv.cpp
MIOpen(HIP): Warning [BuildHip] hip runtime failed to load.
Error: Please provide architecture for which code is to be generated.
MIOpen Error: /build/source/src/hipoc/hipoc_program.cpp:304: Code object build failed. Source: naive_conv.cpp

Any idea what should be in the environment? I tried adding recent meta.rocm-all but it didn't help.

Same problem here with same GPU (7900 XTX). After some strace on your minimal example I noticed that:

openat(AT_FDCWD, "/nix/store/mkih90ygzxczv4k0fn6gapgi7i7wy292-rocm-llvm-libunwind-5.7.1/lib/libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
...
openat(AT_FDCWD, "./libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "./libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
write(2, "MIOpen(HIP): Error [Compile] 'hi"..., 145MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' naive_conv.cpp: HIPRTC_ERROR_COMPILATION (6)
) = 145

It appears that some important libamdhip64.so is not added to runtime library path:

ls $(dirname $(nix-shell -p rocmPackages.meta.rocm-hip-runtime --run "which hipcc"))/../lib/libamdhip64.so
# /nix/store/09ic1qizx0aacml0vi83k9lgq23fz0wg-rocm-hip-runtime-meta/bin/../lib/libamdhip64.so

By setting environment variable manually:

export LD_LIBRARY_PATH=/nix/store/bz15zrilgr04ghdiz4cd73sam5wvmhhw-clr-5.7.1/lib/

The problem is temporarily fixed and I can now run Stable Diffusion WebUI.

Madouura commented 11 months ago

ROCm 6.0.0 has been released. rocmPackages_5 is now in maintenance-mode. I will eventually backport the changes I am making with rocmPackages_6 to rocmPackages_5, however it is not a high priority.

kurnevsky commented 11 months ago

By setting environment variable manually

Interesting - now pytorch works for me, but it doesn't seem to work correctly. I'm trying to generate an image from sdxl+lora with diffusers, and it generates an incorrect image...

I tried identical code and model with manually defined seeds in google colab with cuda - it works there. Also seems to work locally on cpu with f32 types.

(or it might be some problem in one of the libs, since locally I use all python libs from nix)

sersorrel commented 9 months ago

The export LD_LIBRARY_PATH=/nix/store/...-clr-5.7.1/lib solution fixed the same torchWithRocm problem for me, also with a 7900 XTX. I couldn't see how you got that path – it's returned by nix build --print-out-paths nixpkgs#rocmPackages.clr, right?

ScatteredRay commented 9 months ago

Hey, giving this a try. Still very much WIP, but it's working so far for my current project.

dwf commented 8 months ago

@Madouura First, thanks for all your work on this front.

You left a comment to the effect that rocBLASLt is "Very broken with Tensile at the moment, only supports GFX9". It looks like other platforms might be supported now, but I wondered if you might be able to elaborate with the "very broken with Tensile" part. I notice that they ship a vendored "Tensilelite", was that what you were trying to use?

Any pointers you have on how I might manage to build this would be useful. I'm currently eyeing the rocBLAS derivation as a potentially good starting point.

Edit: no longer a priority for me

yshui commented 7 months ago

pytorch now fails to build after 5 -> 6 transition, because it depends on miopengemm which was removed.

SomeoneSerge commented 7 months ago

I edited the description to add an entry for rocblaslt. It's, apparently, a dependency for zluda

jalil-salame commented 6 months ago

Apparently pytorch now requires hipBLASLt:

python3.11-torch> CMake Error at cmake/public/LoadHIP.cmake:37 (find_package):
python3.11-torch>   By not providing "Findhipblaslt.cmake" in CMAKE_MODULE_PATH this project
python3.11-torch>   has asked CMake to find a package configuration file provided by
python3.11-torch>   "hipblaslt", but CMake did not find one.
python3.11-torch>   Could not find a package configuration file provided by "hipblaslt" with
python3.11-torch>   any of the following names:
python3.11-torch>     hipblasltConfig.cmake
python3.11-torch>     hipblaslt-config.cmake
python3.11-torch>   Add the installation prefix of "hipblaslt" to CMAKE_PREFIX_PATH or set
python3.11-torch>   "hipblaslt_DIR" to a directory containing one of the above files.  If
python3.11-torch>   "hipblaslt" provides a separate development package or SDK, be sure it has
python3.11-torch>   been installed.
python3.11-torch> Call Stack (most recent call first):
python3.11-torch>   cmake/public/LoadHIP.cmake:160 (find_package_and_print_version)
python3.11-torch>   cmake/Dependencies.cmake:1258 (include)
python3.11-torch>   CMakeLists.txt:754 (include)
python3.11-torch>
python3.11-torch> -- Configuring incomplete, errors occurred!

ony commented 5 months ago

As per https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992 in 2.4.0+ (future release) it should be possible to use something like:

  pythonPackagesExtensions = prev.pythonPackagesExtensions ++ [
    (python-final: python-prev: {
      torch = python-prev.torch.overrideDerivation (oldAttrs: {
        TORCH_BLAS_PREFER_HIPBLASLT = 0;  # not yet in nixpkgs
      });
    })
  ];

AngryLoki commented 5 months ago

@ony , TORCH_BLAS_PREFER_HIPBLASLT is environment variable for runtime; pytorch still links and requires hipblaslt, even when unused. https://github.com/pytorch/pytorch/pull/120551 should help, but I have no idea whether and when it could be accepted.

By the way, hipblaslt is not difficult to build. Just don't build 6.0 release, skip directly to 6.1. When I tried, bundled TensileLine in 6.0 generated wall of unreadable errors, while 6.1 worked from first attempt.

nixos-discourse commented 4 months ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/testing-gpu-compute-on-amd-apu-nixos/47060/4

DerDennisOP commented 1 month ago

I'm not able to build rocmlir-rock-6.0.2, when trying to install zluda.

FAILED: mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o
/nix/store/16pvlpl13g06f1rqxp7z0il9i4l9mlww-rocm-llvm-clang-wrapper-6.0.2/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/
build/source/build/mlir/lib/Dialect/Rock/Transforms -I/build/source/mlir/lib/Dialect/Rock/Transforms -I/build/source/external/llvm-project/llvm/include -I/build/source/build/external/llvm-project/llvm/include -I/build/source/external/llv
m-project/mlir/include -I/build/source/build/external/llvm-project/llvm/tools/mlir/include -I/build/source/external/mlir-hal/mlir/include -I/build/source/build/external/mlir-hal/include -I/build/source/external/mlir-hal/include -I/build/
source/mlir/include -I/build/source/build/mlir/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wm
issing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmislead
ing-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused
-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsugg
est-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=gnu++17 -fPIC   -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LI
BCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS  -fno-exception
s -funwind-tables -fno-rtti -UNDEBUG -MD -MT mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o -MF mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o.d
-o mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o -c /build/source/mlir/lib/Dialect/Rock/Transforms/ViewToTransform.cpp
In file included from /build/source/mlir/lib/Dialect/Rock/Transforms/ViewToTransform.cpp:14:
/build/source/mlir/include/mlir/Conversion/TosaToRock/TosaToRock.h:21:10: fatal error: 'mlir/Conversion/RocMLIRPasses.h.inc' file not found
#include "mlir/Conversion/RocMLIRPasses.h.inc"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Is there an easy fix for it?

AngryLoki commented 4 weeks ago

@DerDennisOP , it was addressed in pull-request https://github.com/ROCm/rocMLIR/pull/1640 (issue https://github.com/ROCm/rocMLIR/issues/1620), you may want use it.

ilylily commented 4 weeks ago

@DerDennisOP @AngryLoki i think you'll actually also need ROCm/rocMLIR#1542 (closes ROCm/rocMLIR#1500). similar patch in a nearby file

NixOS / nixpkgs