ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.01k stars 210 forks source link

naive_conv_nonpacked_fwd_nchw_half_double_half in KDB cache breaks after #2863 #3023

Open junliume opened 4 weeks ago

junliume commented 4 weeks ago

[Observations}: Reproducer:

./bin/MIOpenDriver convfp16 -n 128 -c 3 -H 224 -W 224 -k 96 -y 7 -x 7 -p 0 -q 0 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

and

MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: naive_conv_nonpacked_fwd_nchw_half_double_half
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_nonpacked_fwd_nchw_half_double_half, global_work_dim = { 3145728, 1, 1 }, local_work_dim = { 256, 1, 1 }
GPU core dump failed
:0:rocdevice.cpp            :2958: 129010028118 us: [pid:163363 tid:0x7f35d0dff640] Callback: Queue 0x7ef5c8a00000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)

However, if we delete the system KDB, this workload can run through

[Analysis] Something in #2863 has caused the incompatibility

FYI: @JehandadKhan @atamazov


junliume commented 4 weeks ago

It should be changes in this file which are causing this problem: https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/solver/conv_direct_naive_conv_fwd.cpp

and similar issues might be in other directions too

junliume commented 4 weeks ago

By removing these lines: https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/solver/conv_direct_naive_conv.cpp#L441-L442 It starts to work properly again.

So the KDB cache does not accept these parameters. However, why the refreshed KDB still does not accept them? @cderb

junliume commented 4 weeks ago

@atamazov FYI, this issue and the other GPU Target embedded in compiler issues are related. We will branch soon and need to regenerate quite a few KDBs with all fixes merged in develop branch first.

atamazov commented 4 weeks ago

@junliume I recommend reverting #2863, if we have this possibility. Adding alpha/beta to the existing convolution primitive is a nontrivial thing and should be carefully analyzed prior implementing.

atamazov commented 2 weeks ago

@junliume

[Analysis] Something in #2863 has caused the incompatibility

Of course! If a PR changes the number of input arguments, or their types, then KDB must be regenerated. And changing the number of input arguments is a substantial change (see https://github.com/ROCm/MIOpen/pull/2863/files#r1641574996).

[Note] Unfortunately, I see that some deeper redesign is required in order to make this bilinear stuff (alpha/beta) working correctly, see https://github.com/ROCm/MIOpen/pull/2863#discussion_r1641574235.