JuliaGPU / AMDGPU.jl

AMD GPU (ROCm) programming in Julia
Other
278 stars 40 forks source link

Broken tests on RX 6950 XT #670

Open ffrancesco94 opened 1 week ago

ffrancesco94 commented 1 week ago

Hi, I have been having some issues with some downstream packages using AMDGPU.jl, so I was trying to backtrack. I am running on Manjaro with an RX 6950 XT and the ROCm version coming from Arch repositories, version 6.1. I am on Julia 1.10.4 from juliaup and when I `Pkg.test("AMDGPU"), I get the following output:

Test Summary:                               |  Pass  Fail  Error  Broken  Total      Time
AMDGPU                                      | 13173     2      2     151  13328  10m11.5s
  test                                      | 13173     2      2     151  13328          
    test/core_tests.jl                      |   615     1                   616          
      core                                  |   615     1                   616   1m14.8s
        Functional                          |     2                           2      0.1s
        HIPDevice                           |     8                           8      0.0s
        ISA parsing                         |    10                          10      0.0s
        Exception holder                    |                              None      2.1s
        Comparison                          |     3                           3      0.0s
        Synchronization                     |     1                           1      5.3s
        Trapping                            |     2                           2      0.0s
        Base                                |   557     1                   558     55.8s
          Specifying buffer type            |     4                           4      0.0s
          ones/zeros                        |     2                           2      1.2s
          view                              |    10                          10      1.7s
          resize!                           |     3                           3      0.3s
          unsafe_wrap                       |    17                          17      3.9s
          unsafe_free                       |                              None      0.0s
          accumulate                        |    25                          25      6.4s
          Atomics                           |     1                           1      0.3s
          Sorting                           |   384                         384     31.2s
          Reverse kernel                    |    88                          88      2.9s
          Selection                         |     3                           3      1.6s
          Multi-GPU                         |    20     1                    21      3.2s
            Device switching                |     7                           7      0.2s
            Arrays                          |     5                           5      1.0s
            Copying                         |     1                           1      0.8s
            Kernel                          |     1     1                     2      1.0s
            Correctly switching HIP context |     6                           6      0.3s
        broadcast                           |    18                          18      6.4s
        Ref Broadcast                       |     1                           1      0.5s
        Broadcast Fix                       |     2                           2      0.7s
        Broadcast Ref{<:Type}               |     1                           1      0.3s
        Device                              |     3                           3      0.0s
        Stream                              |     7                           7      0.3s
    test/device_tests.jl                    |   473                    9    482          
    test/external_tests.jl                  |    18                          18          
    test/gpuarrays_tests.jl                 |  7213                        7213          
    test/hip_core_tests.jl                  |     4            1              5          
      hip - core                            |     4            1              5      2.3s
        AMDGPU.@elapsed                     |     4                           4      0.6s
        HIP Peer Access                     |                  1              1      0.4s
    test/hip_miopen_tests.jl                |                  1              1          
      hip - MIOpen                          |                  1              1      0.0s
    test/hip_rocblas_tests.jl               |   672     1                   673          
      hip - rocBLAS                         |   672     1                   673   1m08.2s
        BLAS                                |   672     1                   673   1m05.4s
          Build Information                 |     1                           1      0.2s
          Highlevel                         |     2                           2      3.8s
          Level 1                           |    51     1                    52     10.6s
            T = Float32                     |    13                          13      1.0s
            T = Float64                     |    13                          13      0.7s
            T = ComplexF32                  |    12     1                    13      7.5s
            T = ComplexF64                  |    13                          13      1.4s
          Level 2                           |   172                         172     12.7s
          Level 3                           |   446                         446     38.1s
    test/hip_rocfft_tests.jl                |   199                         199          
    test/hip_rocrand_tests.jl               |   141                         141          
    test/hip_rocsolver_tests.jl             |   538                         538          
    test/hip_rocsparse_tests.jl             |  1099                  136   1235          
    test/ka_tests.jl                        |  2201                    6   2207          
ERROR: LoadError: Some tests did not pass: 13173 passed, 2 failed, 2 errored, 151 broken.
in expression starting at /home/fra/.julia/packages/AMDGPU/a1v0k/test/runtests.jl:107
ERROR: Package AMDGPU errored during testing

Is this expected behaviour? I do have an integrated APU (which I don't use at the moment), so it might be why some of the MultiGPU tests are failing.

pxl-th commented 1 week ago

ROCm does not support integrated APU I think, but since it is visible it tries to run multi-gpu tests. If you hide it with HIP_VISIBLE_DEVICES and some tests still fail, you can share error messages for those