Open Madouura opened 2 years ago
I believe your GPU is GFX11?
Yes.
New tensorflow-rocm WIP at https://github.com/Madouura/nixpkgs/commit/344aa780809455f545eae4895bed72e3e9af0de6. Current blocking factor is an LLVM mismatch. Most likely, tensorflow 2.13.0 isn't nearly up-to-date enough with rocm 5.7.1.
@Flakebi I have some basic impureTests
stuff at https://github.com/Madouura/nixpkgs/blob/pr/rocm/pkgs/development/rocm-modules/5/rocm-thunk/generic.nix as well as some other stuff.
Tell me if you think this is the best way to go forward please.
Nice!
I think we shouldn’t add anything to <package>.tests
that is not also runnable as a (pure) nix test because these get parsed by scripts and bots.
Why not set the testScript = "${rocmPackages_5.rocm-smi-variants.shared}/bin/rocm-smi"
?
That would be easier to build most tests :)
I think the rocminfo
test can check the output that it actually detected something (like rocminfo | grep -E 'Device Type: +GPU'
and rocm_agent_enumerator | grep -E 'gfx[^0]'
). That makes sure we don’t ship something that’s unable to find GPUs.
I'm going to take a bit of a break from ROCm and work on another project. I'll try to work on the major updates/upgrades here and there, but until early-mid next year the other project is going to be my focus. If there's any major issues or if you just need something explained, don't hesitate to ping me.
Hi. Thanks for maintaining rocm for nix!
When I try to use
torchWithRocm
I got the following error:MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' naive_conv.cpp: HIPRTC_ERROR_COMPILATION (6) MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: naive_conv.cpp MIOpen(HIP): Warning [BuildHip] hip runtime failed to load. Error: Please provide architecture for which code is to be generated. MIOpen Error: /build/source/src/hipoc/hipoc_program.cpp:304: Code object build failed. Source: naive_conv.cpp
Any idea what should be in the environment? I tried adding recent
meta.rocm-all
but it didn't help.
Same problem here with same GPU (7900 XTX). After some strace
on your minimal example I noticed that:
openat(AT_FDCWD, "/nix/store/mkih90ygzxczv4k0fn6gapgi7i7wy292-rocm-llvm-libunwind-5.7.1/lib/libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
...
openat(AT_FDCWD, "./libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "./libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
write(2, "MIOpen(HIP): Error [Compile] 'hi"..., 145MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' naive_conv.cpp: HIPRTC_ERROR_COMPILATION (6)
) = 145
It appears that some important libamdhip64.so
is not added to runtime library path:
ls $(dirname $(nix-shell -p rocmPackages.meta.rocm-hip-runtime --run "which hipcc"))/../lib/libamdhip64.so
# /nix/store/09ic1qizx0aacml0vi83k9lgq23fz0wg-rocm-hip-runtime-meta/bin/../lib/libamdhip64.so
By setting environment variable manually:
export LD_LIBRARY_PATH=/nix/store/bz15zrilgr04ghdiz4cd73sam5wvmhhw-clr-5.7.1/lib/
The problem is temporarily fixed and I can now run Stable Diffusion WebUI.
ROCm 6.0.0 has been released.
rocmPackages_5
is now in maintenance-mode.
I will eventually backport the changes I am making with rocmPackages_6
to rocmPackages_5
, however it is not a high priority.
By setting environment variable manually
Interesting - now pytorch works for me, but it doesn't seem to work correctly. I'm trying to generate an image from sdxl+lora with diffusers, and it generates an incorrect image...
I tried identical code and model with manually defined seeds in google colab with cuda - it works there. Also seems to work locally on cpu with f32 types.
(or it might be some problem in one of the libs, since locally I use all python libs from nix)
The export LD_LIBRARY_PATH=/nix/store/...-clr-5.7.1/lib
solution fixed the same torchWithRocm
problem for me, also with a 7900 XTX. I couldn't see how you got that path – it's returned by nix build --print-out-paths nixpkgs#rocmPackages.clr
, right?
Hey, giving this a try. Still very much WIP, but it's working so far for my current project.
@Madouura First, thanks for all your work on this front.
You left a comment to the effect that rocBLASLt is "Very broken with Tensile at the moment, only supports GFX9". It looks like other platforms might be supported now, but I wondered if you might be able to elaborate with the "very broken with Tensile" part. I notice that they ship a vendored "Tensilelite", was that what you were trying to use?
Any pointers you have on how I might manage to build this would be useful. I'm currently eyeing the rocBLAS derivation as a potentially good starting point.
Edit: no longer a priority for me
pytorch now fails to build after 5 -> 6 transition, because it depends on miopengemm which was removed.
I edited the description to add an entry for rocblaslt. It's, apparently, a dependency for zluda
Apparently pytorch now requires hipBLASLt
:
python3.11-torch> CMake Error at cmake/public/LoadHIP.cmake:37 (find_package):
python3.11-torch> By not providing "Findhipblaslt.cmake" in CMAKE_MODULE_PATH this project
python3.11-torch> has asked CMake to find a package configuration file provided by
python3.11-torch> "hipblaslt", but CMake did not find one.
python3.11-torch> Could not find a package configuration file provided by "hipblaslt" with
python3.11-torch> any of the following names:
python3.11-torch> hipblasltConfig.cmake
python3.11-torch> hipblaslt-config.cmake
python3.11-torch> Add the installation prefix of "hipblaslt" to CMAKE_PREFIX_PATH or set
python3.11-torch> "hipblaslt_DIR" to a directory containing one of the above files. If
python3.11-torch> "hipblaslt" provides a separate development package or SDK, be sure it has
python3.11-torch> been installed.
python3.11-torch> Call Stack (most recent call first):
python3.11-torch> cmake/public/LoadHIP.cmake:160 (find_package_and_print_version)
python3.11-torch> cmake/Dependencies.cmake:1258 (include)
python3.11-torch> CMakeLists.txt:754 (include)
python3.11-torch>
python3.11-torch> -- Configuring incomplete, errors occurred!
As per https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992 in 2.4.0+ (future release) it should be possible to use something like:
pythonPackagesExtensions = prev.pythonPackagesExtensions ++ [
(python-final: python-prev: {
torch = python-prev.torch.overrideDerivation (oldAttrs: {
TORCH_BLAS_PREFER_HIPBLASLT = 0; # not yet in nixpkgs
});
})
];
@ony , TORCH_BLAS_PREFER_HIPBLASLT is environment variable for runtime; pytorch still links and requires hipblaslt, even when unused. https://github.com/pytorch/pytorch/pull/120551 should help, but I have no idea whether and when it could be accepted.
By the way, hipblaslt is not difficult to build. Just don't build 6.0 release, skip directly to 6.1. When I tried, bundled TensileLine in 6.0 generated wall of unreadable errors, while 6.1 worked from first attempt.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/testing-gpu-compute-on-amd-apu-nixos/47060/4
I'm not able to build rocmlir-rock-6.0.2, when trying to install zluda.
FAILED: mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o
/nix/store/16pvlpl13g06f1rqxp7z0il9i4l9mlww-rocm-llvm-clang-wrapper-6.0.2/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/
build/source/build/mlir/lib/Dialect/Rock/Transforms -I/build/source/mlir/lib/Dialect/Rock/Transforms -I/build/source/external/llvm-project/llvm/include -I/build/source/build/external/llvm-project/llvm/include -I/build/source/external/llv
m-project/mlir/include -I/build/source/build/external/llvm-project/llvm/tools/mlir/include -I/build/source/external/mlir-hal/mlir/include -I/build/source/build/external/mlir-hal/include -I/build/source/external/mlir-hal/include -I/build/
source/mlir/include -I/build/source/build/mlir/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wm
issing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmislead
ing-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused
-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsugg
est-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=gnu++17 -fPIC -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LI
BCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -fno-exception
s -funwind-tables -fno-rtti -UNDEBUG -MD -MT mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o -MF mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o.d
-o mlir/lib/Dialect/Rock/Transforms/CMakeFiles/obj.MLIRRockTransforms.dir/ViewToTransform.cpp.o -c /build/source/mlir/lib/Dialect/Rock/Transforms/ViewToTransform.cpp
In file included from /build/source/mlir/lib/Dialect/Rock/Transforms/ViewToTransform.cpp:14:
/build/source/mlir/include/mlir/Conversion/TosaToRock/TosaToRock.h:21:10: fatal error: 'mlir/Conversion/RocMLIRPasses.h.inc' file not found
#include "mlir/Conversion/RocMLIRPasses.h.inc"
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is there an easy fix for it?
@DerDennisOP , it was addressed in pull-request https://github.com/ROCm/rocMLIR/pull/1640 (issue https://github.com/ROCm/rocMLIR/issues/1620), you may want use it.
@DerDennisOP @AngryLoki i think you'll actually also need ROCm/rocMLIR#1542 (closes ROCm/rocMLIR#1500). similar patch in a nearby file
Tracking issue for ROCm derivations.
Key
WIP
-
Ready
-
TODO
Merged
ROCm-related
261155
263048
Notes
nix-shell maintainers/scripts/update.nix --argstr commit true --argstr keep-going true --arg predicate '(path: pkg: builtins.elem (pkg.pname or null) [ "rocm-llvm-llvm" "rocm-core" "rocm-cmake" "rocm-thunk" "rocm-smi" "rocm-device-libs" "rocm-runtime" "rocm-comgr" "rocminfo" "clang-ocl" "rdc" "rocm-docs-core" "hip-common" "hipcc" "clr" "hipify" "rocprofiler" "roctracer" "rocgdb" "rocdbgapi" "rocr-debug-agent" "rocprim" "rocsparse" "rocthrust" "rocrand" "rocfft" "rccl" "hipcub" "hipsparse" "hipfort" "hipfft" "tensile" "rocblas" "rocsolver" "rocwmma" "rocalution" "rocmlir" "hipsolver" "hipblas" "miopengemm" "composable_kernel" "half" "miopen" "migraphx" "rpp-hip" "mivisionx-hip" "hsa-amd-aqlprofile-bin" ])'
Won't implement
strictDeps
for all derivations