JuliaGPU / AMDGPU.jl

AMD GPU (ROCm) programming in Julia
Other
276 stars 39 forks source link

Tests hang on Windows (RX7900XT) #653

Open Victorious3 opened 1 month ago

Victorious3 commented 1 month ago

I'm trying to get AMDGPU to work on Windows with an RX7900XT. The test output lists successfully finding the gpu and my igpu. However, the tests hang. After interrupting them I got this output:

┌ Warning: MIOpen is unavailable, functionality will be disabled.
└ @ AMDGPU C:\Users\Vic\.julia\packages\AMDGPU\WqMSe\src\AMDGPU.jl:216
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
Environment:
  JULIA_LOAD_PATH = @;C:\Users\Vic\AppData\Local\Temp\jl_1YXWuJ
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬────────────────────────────────────────────────────────────────────────────
│ Available │ Name             │ Version   │ Path                                                                      ⋯
├───────────┼──────────────────┼───────────┼────────────────────────────────────────────────────────────────────────────
│     +     │ LLD              │ -         │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\ld.lld.exe                        ⋯
│     +     │ Device Libraries │ -         │ C:\\Users\\Vic\\.julia\\artifacts\\5ad5ecb46e3c334821f54c1feecc6c152b7b6a ⋯
│     +     │ HIP              │ 5.7.32000 │ C:\\Windows\\SYSTEM32\\amdhip64.DLL                                       ⋯
│     +     │ rocBLAS          │ 4.1.2     │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocblas.dll                       ⋯
│     +     │ rocSOLVER        │ 3.25.0    │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocsolver.dll                     ⋯
│     +     │ rocALUTION       │ -         │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocalution.dll                    ⋯
│     +     │ rocSPARSE        │ -         │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocsparse.dll                     ⋯
│     +     │ rocRAND          │ 2.10.5    │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocrand.dll                       ⋯
│     +     │ rocFFT           │ 1.0.27    │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocfft.dll                        ⋯
│     -     │ MIOpen           │ -         │ -                                                                         ⋯
└───────────┴──────────────────┴───────────┴────────────────────────────────────────────────────────────────────────────
                                                                                                        1 column omitted

[ Info: AMDGPU devices
┌────┬─────────────────────────┬──────────┬───────────┬────────────┐
│ Id │                    Name │ GCN arch │ Wavefront │     Memory │
├────┼─────────────────────────┼──────────┼───────────┼────────────┤
│  1 │   AMD Radeon RX 7900 XT │  gfx1100 │        32 │ 19.984 GiB │
│  2 │ AMD Radeon(TM) Graphics │  gfx1036 │        32 │ 24.003 GiB │
└────┴─────────────────────────┴──────────┴───────────┴────────────┘

[ Info: Test suite info
┌─────────┬───────────────────────────────────────────────────────────────┬───────────────────────────────────────────────┐
│ Workers │                                                        Device │                                         Tests │
├─────────┼───────────────────────────────────────────────────────────────┼───────────────────────────────────────────────┤
│       2 │ HIPDevice(id=1, name=AMD Radeon RX 7900 XT, gcn_arch=gfx1100) │ core, hip, ext, gpuarrays, kernelabstractions │
└─────────┴───────────────────────────────────────────────────────────────┴───────────────────────────────────────────────┘
[ Info: Scanning for test items in project `AMDGPU` at paths: C:\Users\Vic\.julia\packages\AMDGPU\WqMSe
[ Info: Finished scanning for test items in 0.51 seconds. Scheduling 34 tests on pid 15192 with 2 worker processes and 1 threads per worker.
[ Info: Starting test workers
  Worker 27588:  [ Info: Starting test worker 2 on pid = 27588, with 1 threads
  Worker 27644:  [ Info: Starting test worker 1 on pid = 27644, with 1 threads
[ Info: Starting running test items
  Worker 27588:  18:57:42 | maxrss  0.5% | mem 21.1% | START ( 2/34) test item "gpuarrays - reductions/== isequal" at test\gpuarrays_tests.jl:57
  Worker 27644:  18:57:42 | maxrss  0.6% | mem 21.1% | START ( 1/34) test item "core" at test\core_tests.jl:1

     Testing Tests interrupted. Exiting the test process

  Worker 27588:  18:58:30 | maxrss  1.7% | mem 21.5% | DONE  ( 2/34) test item "gpuarrays - reductions/== isequal" 45.3 secs (72.7% compile, <0.1% recompile, 3.3% GC), 82.24 M allocs (4.883 GB)

Captured Logs for test item "core" at test\core_tests.jl:1 on worker 27644
:0:C:\constructicon\builds\gfx\eleven\24.10\drivers\compute\clr\hipamd\src\hip_fatbin.hpp:74  : 59097994949 us: [pid:27644 tid:0x7412] Invalid DeviceId less than 0
┌ Error: Worker(pid=27644, terminated=true, termsignal=0) died running test item "core". Recording test error.
└ @ ReTestItems C:\Users\Vic\.julia\packages\ReTestItems\VrjGK\src\ReTestItems.jl:585
  Worker 27588:  fatal: error thrown and no exception handler available.
InterruptException()

Captured logs for test setup "TSGPUArrays" (dependency of "gpuarrays - reductions/== isequal") at test\gpuarrays_tests.jl:1 on worker 27588
┌ Warning: MIOpen is unavailable, functionality will be disabled.
└ @ AMDGPU C:\Users\Vic\.julia\packages\AMDGPU\WqMSe\src\AMDGPU.jl:216
No Captured Logs for test item "gpuarrays - reductions/== isequal" at test\gpuarrays_tests.jl:57 on worker 27588
┌ Error: Worker(pid=27588, terminated=true, termsignal=15) timed out running test item "gpuarrays - reductions/== isequal" after 1800 seconds. Recording test error.
└ @ ReTestItems C:\Users\Vic\.julia\packages\ReTestItems\VrjGK\src\ReTestItems.jl:579
  Worker 35092:  [ Info: Starting test worker on pid = 35092, with 1 threads
  Worker 3380:  [ Info: Starting test worker on pid = 3380, with 1 threads
  Worker 35092:  20:00:10 | maxrss  0.6% | mem 37.0% | START ( 3/34) test item "core: device" at test\device_tests.jl:1
  Worker 3380:  20:00:10 | maxrss  0.6% | mem 37.0% | START ( 4/34) test item "gpuarrays - reductions/any all count" at test\gpuarrays_tests.jl:60

The relevant thing seems to be

:0:C:\constructicon\builds\gfx\eleven\24.10\drivers\compute\clr\hipamd\src\hip_fatbin.hpp:74  : 59097994949 us: [pid:27644 tid:0x7412] Invalid DeviceId less than 0

Do I have to explicitly give it a GPU to run on or is this some other issue?

pxl-th commented 1 month ago

There are some issues with multi-gpu setup, not sure if this is the one as well: https://github.com/JuliaGPU/AMDGPU.jl/issues/648

You can disabling multi-gpu tests with HIP_VISIBLE_DEVICES=0 to see if the hangs dissapear.

Another cause might be the same as hangs we had with Navi 3 up until recently which were fixed upstream, since they were driver/ROCm issues https://github.com/JuliaGPU/AMDGPU.jl/pull/650#issuecomment-2212543523