`hipFreeAsync` hangs - Githubissues

pxl-th commented 11 months ago

Hi, I'm experiencing hangs with hipFreeAsync and was wondering what could potentially cause that. From my perspective it looks like some kind of racing condition.

It consistently happens at the end of the test suite when we start to release memory of the device arrays used in the process in AMDGPU.jl which provides AMD GPU programming interface in Julia language. Just to note, that memory free happens a lot during tests, it just that it hangs at the end. I made sure that we do not destroy streams or respective context. Also, freeing arrays uses NULL stream, but for other operations we use other streams. I started seeing this issues with ROCm 5.6-5.7.1 and using RX7900XT.

Here's gdb output of the process when it hangs: hang

On ROCm 5.4 it was not observed and the whole test suite ran fine.

If you need any additional info, I'm happy to provide.

pxl-th commented 11 months ago

I also ran tests using debug Julia & HIP build and besides hitting this assert (which I commented out) there were no other issues.

iassiour commented 11 months ago

Hi @pxl-th can you please attach a reproducer for the issue. Can you reproduce the hang in C++ as well?

pxl-th commented 11 months ago

Unfortunately, I was unable to create a MWE as it is unclear to me what causes it. Running the tests one-by-one does not reproduce it, only when running them all. I tried running them on multiple workers and on just a single thread and in all cases it hangs. But the place where it hangs might change from run to run.

When running tests I get a lot of page faults in dmesg as described here. Although I'm not sure if this is critical enough to cause hangs.

Also, reproducing the tests with C++ is not easy, because we have almost 13k tests. So the best I can suggest is to try running AMDGPU tests yourself, which is quite easy:

Have ROCm installation in the default directory /opt/rocm.
Download & unpack Julia 1.10: https://julialang-s3.julialang.org/bin/linux/x64/1.10/julia-1.10.0-rc1-linux-x86_64.tar.gz
Launch Julia REPL with <julia-dir>/bin/julia --threads=auto
Enter package mode with ] key
Add AMDGPU.jl package: add AMDGPU#master
Run AMDGPU tests with test AMDGPU

At some point, test workers will become idle and inspecting them with gdb will show this hang.

I'm also not sure if this is an issue with Julia or AMDGPU.jl package, we've been successfully running CI on RX6700XT for several months now without issues using ROCm 5.4 - 5.6 and tried other GPUs like MI200.

pxl-th commented 11 months ago

Also, on Windows there are no issues at all with RX7900XT, it passes all AMDGPU.jl tests without hanging.

pxl-th commented 11 months ago

@iassiour, not sure if this is expected, but I noticed that async malloc/free vs non-async is ~300x slower (tried on RX6700 XT and RX7900 XT).

MWE:

#include <hip/hip_runtime.h>
#include <iostream>

using namespace std;

void check(int res) {
    if (res != 0) {
        std::cerr << "Fail" << std::endl;
    }
}

int main(int argc, char* argv[]) {
    hipStream_t s;
    check(hipStreamCreateWithPriority(&s, 0, 0));

    /*
    std::cout << "Regular" << std::endl;
    for (int i = 1; i < 100000; i++) {
        float *x;
        check(hipMalloc((void**)&x, 4));
        check(hipFree(x));
    }
    */

    std::cout << "Async" << std::endl;
    for (int i = 1; i < 100000; i++) {
        float *x;
        check(hipMallocAsync((void**)&x, 4, s));
        check(hipFreeAsync(x, s));
    }

    return 0;
}

pxl-th@Leleka:~/code$ time ./a.out 
Regular

real    0m0,256s
user    0m0,206s
sys 0m0,033s

pxl-th@Leleka:~/code$ time ./a.out 
Async

real    1m15,237s
user    1m47,751s
sys 0m0,828s

iassiour commented 11 months ago

Hi @pxl-th I think the slowness in async malloc/free is caused by a bug triggered by small (<8 byte) allocations. I will create a PR internally to fix it. Thank you for reporting this.

pxl-th commented 11 months ago

Indeed, smaller than 8 bytes allocations are much slower. Thanks! However, with e.g. 16 bytes it is still 3-5x slower:

pxl-th@Leleka:~/code$ time ./a.out 
Regular

real    0m0,255s
user    0m0,203s
sys 0m0,034s

pxl-th@Leleka:~/code$ time ./a.out 
Async

real    0m0,684s
user    0m1,005s
sys 0m0,137s

As a note, the reason I've stumbled upon this is that users of AMDGPU.jl reported 20x slower performance than CPU when training ODE to solve MNIST. And it progressively was getting worse as you repeat the task (e.g. run training over and over again).

Moving to non-async malloc/free led to 6x improvement in performance and stable compute time. Although I haven't looked at how big are the allocations there.

iassiour commented 10 months ago

Hi @pxl-th the fix for < 8-byte allocations has been merged in develop https://github.com/ROCm/clr/commit/2ede1c9adb76e04627fea8126f9cea9c03740574 and it should appear in future release.

Regarding the 16-byte allocations timing test: There is an extra bookkeeping overhead associated with the memory pool APIs. While this overhead is quite small, the small memory allocations are also generally fast. In addition, this latency can be hidden if the application takes full advantage of the async API. In this particular example there is no computation done to overlap with the allocation overhead and hence the latency is not hidden. Additionally please note that this API is currently in Beta state so it is subject to ongoing changes which might improve/impact the performance as we polish our implementation.

Regarding the hang in hipFreeAsync mentioned in the original post, I could not immediately reproduce the issue with 5.7.1 ubuntu 22.04 but with a RX7900XTX. Is there a specific subtest that the workers become idle or it happens in the end of the process? If possible can please attach the logs up to the hanging point.

pxl-th commented 10 months ago

Thank you for the fix!

Regarding hipFreeAsync and hangs, I recently upgraded to ROCm 6 and when running AMDGPU.jl tests it reported some page faults (and errored instead of hanged), so I was able to fix those (rocBLAS related).

Now I'm able to successfully run the test suite, however, it still hangs randomly when running tests and doing some graphics stuff at the same time. Here's the hang from yesterday CI run: link.

I was screencasting at the same time as running tests, but just re-running tests without it worked fine (see CI run just below the failed one). I still see some page-faults occasionally as described here, but I'm not sure if they are related to hangs.

Is there a specific subtest that the workers become idle or it happens in the end of the process?

Usually it hangs at some internal synchronization point. gdb backtrace is either the same as in the original post or similar but in hipMemcpyDtoH waiting for all streams.

So besides suggesting to run the tests and do some graphics related stuff at the same time I'm not sure how else to reproduce it... But at least now CI passes with Navi 3, so that's an improvement :) We still have some tests that fail on Navi 3, so I'll investigate those and update here if they are related.

saleelk commented 10 months ago

Find the smallest test case, and dump the AMD_LOG_LEVEL=4 for it.

pxl-th commented 10 months ago

There are tests that reliably trigger the hang. In Julia we use Task-Local State (TLS) as opposed to Thread-Local State. And each Task in Julia has its own HIP stream, that's how users are advised to use multiple gpus at the same time.

For this we have tests that check that TLS is working properly, where we create streams with different priorities and check that TLS is updated accordingly (that are then destroyed one GC collects them). When running these tests (among other tests) with 2+ workers it causes the hang.

By default those tests are disabled for Navi 3, so I've uncommented them inpxl-th/tls branch for AMDGPU.jl. Just in case, AMDGPU.jl for this branch can be installed with ]add AMDGPU#pxl-th/tls command.

AMD_LOG_LEVEL=4 logs: https://buildkite.com/julialang/amdgpu-dot-jl/builds/2233#018c9287-54bd-488d-9b4d-d6f1cae003f2

gdb on the hanged process (I'm currenlty using release build, but it can be roughly inferred to the internal synchronization caused by hipMemcpyDtoH):

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__GI___ioctl (fd=22, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36  ../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
(gdb) bt
#0  __GI___ioctl (fd=22, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007fca460a9120 in ?? () from target:/opt/rocm-6.0.0/lib/libhsa-runtime64.so.1
#2  0x00007fca460a1f20 in ?? () from target:/opt/rocm-6.0.0/lib/libhsa-runtime64.so.1
#3  0x00007fca460a26cb in ?? () from target:/opt/rocm-6.0.0/lib/libhsa-runtime64.so.1
#4  0x00007fca4600d0d0 in ?? () from target:/opt/rocm-6.0.0/lib/libhsa-runtime64.so.1
#5  0x00007fca4600cdae in ?? () from target:/opt/rocm-6.0.0/lib/libhsa-runtime64.so.1
#6  0x00007fca46001d19 in ?? () from target:/opt/rocm-6.0.0/lib/libhsa-runtime64.so.1
#7  0x00007fca4eeae19d in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#8  0x00007fca4eeae730 in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#9  0x00007fca4eeb3fce in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#10 0x00007fca4eee23ba in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#11 0x00007fca4eee415d in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#12 0x00007fca4eee43d1 in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#13 0x00007fca4eeb0c75 in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#14 0x00007fca4ee7a424 in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#15 0x00007fca4ed103fa in ?? () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#16 0x00007fca4ed117a0 in hipMemcpyDtoH () from target:/opt/rocm-6.0.0/lib/libamdhip64.so
#17 0x00007fca5b3d6b9b in ?? ()

dmesg at the time of hang:

[16837.325405] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[16837.325418] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000001000 from client 10
[16837.325424] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[16837.325428] amdgpu 0000:2f:00.0: amdgpu:     Faulty UTCL2 client ID: CPC (0x5)
[16837.325432] amdgpu 0000:2f:00.0: amdgpu:     MORE_FAULTS: 0x0
[16837.325435] amdgpu 0000:2f:00.0: amdgpu:     WALKER_ERROR: 0x1
[16837.325439] amdgpu 0000:2f:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[16837.325442] amdgpu 0000:2f:00.0: amdgpu:     MAPPING_ERROR: 0x1
[16837.325445] amdgpu 0000:2f:00.0: amdgpu:     RW: 0x0
[16882.774186] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdb1b2000 flags=0x0020]
[16882.880711] gmc_v11_0_process_interrupt: 4 callbacks suppressed
[16882.880716] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[16882.880721] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdb1b2000 flags=0x0000]
[16882.880728] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[16882.880733] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdb1b2000 flags=0x0020]
[16882.880734] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[16882.880737] amdgpu 0000:2f:00.0: amdgpu:     Faulty UTCL2 client ID: CPC (0x5)
[16882.880741] amdgpu 0000:2f:00.0: amdgpu:     MORE_FAULTS: 0x0
[16882.880744] amdgpu 0000:2f:00.0: amdgpu:     WALKER_ERROR: 0x1
[16882.880748] amdgpu 0000:2f:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[16882.880751] amdgpu 0000:2f:00.0: amdgpu:     MAPPING_ERROR: 0x1
[16882.880754] amdgpu 0000:2f:00.0: amdgpu:     RW: 0x0
[16882.907117] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdb1b2000 flags=0x0020]
[16894.050735] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[16894.050740] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdb1b2000 flags=0x0000]
[16894.050748] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[16894.050754] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[16894.050758] amdgpu 0000:2f:00.0: amdgpu:     Faulty UTCL2 client ID: CPC (0x5)
[16894.050763] amdgpu 0000:2f:00.0: amdgpu:     MORE_FAULTS: 0x0
[16894.050766] amdgpu 0000:2f:00.0: amdgpu:     WALKER_ERROR: 0x1
[16894.050770] amdgpu 0000:2f:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[16894.050771] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdb1b2000 flags=0x0020]
[16894.050775] amdgpu 0000:2f:00.0: amdgpu:     MAPPING_ERROR: 0x1
[16894.050778] amdgpu 0000:2f:00.0: amdgpu:     RW: 0x0
[16902.616293] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[16902.616299] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[16902.616302] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[16902.616304] amdgpu 0000:2f:00.0: amdgpu:     Faulty UTCL2 client ID: CPC (0x5)
[16902.616306] amdgpu 0000:2f:00.0: amdgpu:     MORE_FAULTS: 0x0
[16902.616307] amdgpu 0000:2f:00.0: amdgpu:     WALKER_ERROR: 0x1
[16902.616309] amdgpu 0000:2f:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[16902.616310] amdgpu 0000:2f:00.0: amdgpu:     MAPPING_ERROR: 0x1
[16902.616312] amdgpu 0000:2f:00.0: amdgpu:     RW: 0x0
[16902.616313] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdb1b2000 flags=0x0000]
[16902.631525] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[16902.631530] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[16902.631533] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[16902.631534] amdgpu 0000:2f:00.0: amdgpu:     Faulty UTCL2 client ID: CPC (0x5)
[16902.631536] amdgpu 0000:2f:00.0: amdgpu:     MORE_FAULTS: 0x0
[16902.631538] amdgpu 0000:2f:00.0: amdgpu:     WALKER_ERROR: 0x1
[16902.631539] amdgpu 0000:2f:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[16902.631541] amdgpu 0000:2f:00.0: amdgpu:     MAPPING_ERROR: 0x1
[16902.631542] amdgpu 0000:2f:00.0: amdgpu:     RW: 0x0
[16902.631543] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdb1b2000 flags=0x0000]

pxl-th commented 8 months ago

Reviving this as I have a fairly small MWE that consistently reproduces the issue. On ROCm 6.0.2 and RX7900 XTX.

Again in Julia as it is much easier to set up the code.

MWE:

using AMDGPU
function main()
    data = rand(Float64, 1024, 1024)
    Threads.@threads for i in 1:1000
        sum(ROCArray(data))
    end
end
main()

starts multiple threads (2 is enough but more threads trigger this more reliably)
in each thread copies the data from the host to the device (hipMallocAsync, hipMemcpyHtoDAsync)
computes the sum of the array (hipModuleLaunchKernel)
frees the array (hipFreeAsync)

And at some point during execution it hangs. Notice: that if I replace hipFreeAsync with hipFree then it never hangs.

Output of kill -USR1 PID for each Julia thread (two of them). Notice that one thread hangs at hipModuleLaunchKernel and another at hipFreeAsync. This is with debug HIP build.

cmd: /home/pxl-th/bin/julia-1.10.1/bin/julia 55042 running 2 of 2

# Thread 1

unknown function (ip: 0x7f8322c91115)
unknown function (ip: 0x7f8322c9cc77)
wait at /home/pxl-th/code/clr/rocclr/thread/semaphore.cpp:96
finishLock at /home/pxl-th/code/clr/rocclr/thread/monitor.cpp:121
lock at /home/pxl-th/code/clr/rocclr/thread/monitor.hpp:207
ScopedLock at /home/pxl-th/code/clr/rocclr/thread/monitor.hpp:163
isValid at /home/pxl-th/code/clr/hipamd/src/hip_stream.cpp:98
hipModuleLaunchKernel at /home/pxl-th/code/clr/hipamd/src/hip_module.cpp:440
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/hip/call.jl:38 [inlined]
hipModuleLaunchKernel at /home/pxl-th/.julia/dev/AMDGPU/src/hip/libhip.jl:282
#24 at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:123 [inlined]
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:110 [inlined]
macro expansion at ./none:0 [inlined]
pack_arguments at ./none:0
#launch#23 at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:122 [inlined]
launch at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:116 [inlined]
#18 at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:85 [inlined]
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:78 [inlined]
macro expansion at ./none:0 [inlined]
convert_arguments at ./none:0 [inlined]
#roccall#17 at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:86 [inlined]
roccall at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:84 [inlined]
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:50 [inlined]
macro expansion at ./none:0 [inlined]
#call#1 at ./none:0
unknown function (ip: 0x7f83219799ed)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
call at ./none:0 [inlined]
#_#15 at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:59
HIPKernel at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/hip-execution.jl:54
unknown function (ip: 0x7f8321979415)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/highlevel.jl:175 [inlined]
#mapreducedim!#59 at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:155
mapreducedim! at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:86
unknown function (ip: 0x7f8321977230)
#mapreducedim!#59 at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:167
mapreducedim! at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:86 [inlined]
#mapreducedim!#59 at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:167
mapreducedim! at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:86 [inlined]
#_mapreduce#43 at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/mapreduce.jl:67
_mapreduce at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/mapreduce.jl:33 [inlined]
#mapreduce#41 at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/mapreduce.jl:28 [inlined]
mapreduce at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/mapreduce.jl:28 [inlined]
#_sum#831 at ./reducedim.jl:1015 [inlined]
_sum at ./reducedim.jl:1015 [inlined]
#_sum#830 at ./reducedim.jl:1014 [inlined]
_sum at ./reducedim.jl:1014 [inlined]
#sum#828 at ./reducedim.jl:1010 [inlined]
sum at ./reducedim.jl:1010 [inlined]
macro expansion at /home/pxl-th/.julia/dev/t.jl:26 [inlined]
#39#threadsfor_fun#7 at ./threadingconstructs.jl:215
#39#threadsfor_fun at ./threadingconstructs.jl:182 [inlined]
#1 at ./threadingconstructs.jl:154
unknown function (ip: 0x7f83218af892)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
start_task at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/task.c:1238
unknown function (ip: (nil))

# Thread 2

unknown function (ip: 0x7f8322c91115)
unknown function (ip: 0x7f8322c9cc77)
wait at /home/pxl-th/code/clr/rocclr/thread/semaphore.cpp:96
finishLock at /home/pxl-th/code/clr/rocclr/thread/monitor.cpp:121
lock at /home/pxl-th/code/clr/rocclr/thread/monitor.hpp:207
ScopedLock at /home/pxl-th/code/clr/rocclr/thread/monitor.hpp:163
FreeMemory at /home/pxl-th/code/clr/hipamd/src/hip_device.cpp:90
submit at /home/pxl-th/code/clr/hipamd/src/hip_mempool.cpp:112
enqueue at /home/pxl-th/code/clr/rocclr/platform/command.cpp:391
hipFreeAsync at /home/pxl-th/code/clr/hipamd/src/hip_mempool.cpp:137
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/hip/call.jl:38 [inlined]
hipFreeAsync at /home/pxl-th/.julia/dev/AMDGPU/src/hip/libhip.jl:174 [inlined]
#free#9 at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/memory/hip.jl:134
free at /home/pxl-th/.julia/dev/AMDGPU/src/runtime/memory/hip.jl:129 [inlined]
#43 at /home/pxl-th/.julia/dev/AMDGPU/src/array.jl:30 [inlined]
context! at /home/pxl-th/.julia/dev/AMDGPU/src/tls.jl:131
_free_buf at /home/pxl-th/.julia/dev/AMDGPU/src/array.jl:28
unknown function (ip: 0x7f83219788cc)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
release at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/abstractarray.jl:42
unsafe_free! at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/abstractarray.jl:91 [inlined]
unsafe_free! at /home/pxl-th/.julia/dev/AMDGPU/src/array.jl:34 [inlined]
#mapreducedim!#59 at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:168
mapreducedim! at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:86 [inlined]
#mapreducedim!#59 at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:167
mapreducedim! at /home/pxl-th/.julia/dev/AMDGPU/src/kernels/mapreduce.jl:86 [inlined]
#_mapreduce#43 at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/mapreduce.jl:67
_mapreduce at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/mapreduce.jl:33 [inlined]
#mapreduce#41 at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/mapreduce.jl:28 [inlined]
mapreduce at /home/pxl-th/.julia/packages/GPUArrays/Hd5Sk/src/host/mapreduce.jl:28 [inlined]
#_sum#831 at ./reducedim.jl:1015 [inlined]
_sum at ./reducedim.jl:1015 [inlined]
#_sum#830 at ./reducedim.jl:1014 [inlined]
_sum at ./reducedim.jl:1014 [inlined]
#sum#828 at ./reducedim.jl:1010 [inlined]
sum at ./reducedim.jl:1010 [inlined]
macro expansion at /home/pxl-th/.julia/dev/t.jl:26 [inlined]
#39#threadsfor_fun#7 at ./threadingconstructs.jl:215
#39#threadsfor_fun at ./threadingconstructs.jl:182 [inlined]
#1 at ./threadingconstructs.jl:154
unknown function (ip: 0x7f83218af892)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
start_task at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/task.c:1238
unknown function (ip: (nil))

Output of gdb -p PID for one thread:

(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0xde5448)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0xde5448)
    at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xde5448, expected=expected@entry=0, 
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ff13e29cbdf in do_futex_wait (sem=sem@entry=0xde5448, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ff13e29cc78 in __new_sem_wait_slow64 (sem=0xde5448, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ff06957fffe in amd::Semaphore::wait (this=0xde5440) at /home/pxl-th/code/clr/rocclr/thread/semaphore.cpp:96
#6  0x00007ff06957f43d in amd::Monitor::finishLock (this=0x7ff06ab140c0 <streamSetLock>)
    at /home/pxl-th/code/clr/rocclr/thread/monitor.cpp:121
#7  0x00007ff069243506 in amd::Monitor::lock (this=0x7ff06ab140c0 <streamSetLock>)
    at /home/pxl-th/code/clr/rocclr/thread/monitor.hpp:207
#8  0x00007ff069243318 in amd::ScopedLock::ScopedLock (this=0x7fef5d5fd810, lock=...)
    at /home/pxl-th/code/clr/rocclr/thread/monitor.hpp:163
#9  0x00007ff06945d5cc in iHipWaitActiveStreams (blocking_stream=0x1885a00, wait_null_stream=true)
    at /home/pxl-th/code/clr/hipamd/src/hip_stream.cpp:204
#10 0x00007ff069251f31 in hip::getStream (stream=0x1885a00, wait=true) at /home/pxl-th/code/clr/hipamd/src/hip_context.cpp:99
#11 0x00007ff069293448 in hip::Event::addMarker (this=0x19ae190, stream=0x1885a00, command=0x0, record=true)
    at /home/pxl-th/code/clr/hipamd/src/hip_event.cpp:251
#12 0x00007ff0693fc532 in hip::MemoryPool::FreeMemory (this=0x15acb70, memory=0x1c6da90, stream=0x1885a00)
    at /home/pxl-th/code/clr/hipamd/src/hip_mempool_impl.cpp:249
#13 0x00007ff06927f94f in hip::Device::FreeMemory (this=0xef7400, memory=0x1c6da90, stream=0x1885a00)
    at /home/pxl-th/code/clr/hipamd/src/hip_device.cpp:93
#14 0x00007ff0693f8792 in FreeAsyncCommand::submit (this=0x1c6e860, device=...)
    at /home/pxl-th/code/clr/hipamd/src/hip_mempool.cpp:112
#15 0x00007ff069546692 in amd::Command::enqueue (this=0x1c6e860) at /home/pxl-th/code/clr/rocclr/platform/command.cpp:391
#16 0x00007ff0693e9dd0 in hipFreeAsync (dev_ptr=0x7fef3c220000, stream=0x1885a00)
    at /home/pxl-th/code/clr/hipamd/src/hip_mempool.cpp:137
#17 0x00007ff13cfe8cd3 in ?? ()
#18 0x000000000000000c in ?? ()

CC @saleelk @iassiour

pxl-th commented 8 months ago

Mixing default and non-default streams in hip*Async functions seems to cause hangs.

Here's C++ reproducer:

#include <hip/hip_runtime.h>
#include <thread>

void fn() {
    hipStream_t stream;
    hipStreamCreate(&stream);

    int n_elements = 1024 * 1024;
    int size = n_elements * sizeof(int);

    int *a = new int[n_elements];
    int *b = new int[n_elements];

    int *da, *db;
    hipMallocAsync(&da, size, stream);
    hipMallocAsync(&db, size, stream);

    hipMemcpyHtoDAsync(da, a, size, stream);
    hipMemcpyHtoDAsync(db, b, size, stream);

    /* hipFreeAsync(da, stream); */ // <--- Works fine.
    hipFreeAsync(da, nullptr); // <--- Mixing default stream with non-default causes hang!
    hipFreeAsync(db, stream);

    hipStreamSynchronize(stream);
    hipStreamDestroy(stream);

    delete[] a;
    delete[] b;
}

void thread_fn() {
    for (int i = 0; i < 1000; i++) {
        fn();
    }
}

int main() {
    std::thread t1(thread_fn);
    std::thread t2(thread_fn);
    std::thread t3(thread_fn);
    std::thread t4(thread_fn);

    t1.join();
    t2.join();
    t3.join();
    t4.join();
    return 0;
}

pxl-th commented 7 months ago

Kind ping, to see if someone can take a look at the issue.

luraess commented 6 months ago

Testing on ROCm 6.1 with RX 7800 XT, the Julia MWE does no longer hang. However, the C++ reproducer cannot complete.

torrance commented 6 months ago

This might be related to this issue: https://github.com/ROCm/hipFFT/issues/91

pxl-th commented 6 months ago

@torrance thanks for the update! This should significantly help with CI in AMDGPU.jl

luraess commented 6 months ago

Indeed - thanks! So this should land in ROCm 6.1.1 right

ppanchad-amd commented 5 months ago

@luraess It's fixed in future release of ROCm 6.1.2 Thanks!

darren-amd commented 1 month ago

Hi @pxl-th,

I tested the C++ reproducer code on the latest version of ROCm (6.2.2) and can confirm that the issue appears to be fixed. Please give it a try on the latest version and let me know if the issue persists. Thanks!

ROCm / HIP

`hipFreeAsync` hangs #3370