Memory access fault by GPU node-2 on Fiji gfx803 (with test cases and ISA)

ex-rzr commented 5 years ago

This issue is related to https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/52 https://github.com/ROCmSoftwarePlatform/rocPRIM/issues/15

HCC 1.3.18482 (ROCm 2.0)

Memory access fault by GPU node-2 (Agent handle: 0x7f3e4aa1b090) on address 0x1e04405000. Reason: Page not present or supervisor privilege.

https://github.com/ROCmSoftwarePlatform/rocPRIM/tree/for_hcc_issue_1024

This branch has the workaround disabled and only one active test for simplicity (https://github.com/ROCmSoftwarePlatform/rocPRIM/commit/e9753c4ca92eb3d8b88a4bdb0e81ce763c86cea5).

git clone -b for_hcc_issue_1024 https://github.com/ROCmSoftwarePlatform/rocPRIM.git
cd rocPRIM
mkdir build && cd build
CXX=hcc cmake ..
make test_hip_device_scan && test/rocprim/test_hip_device_scan

https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/for_hcc_issue_1024/rocprim/include/rocprim/device/detail/device_scan_lookback.hpp#L279

Both branches of if statement (if(flat_block_id == 0)) can be executed for a thread block with flat_block_id == 0. I checked it by disabling some parts of the code and writing intermediate values to buffers, and indeed, both if-branch and else-branch are executed leading to incorrect memory access.

So this "workaround" separates two branches of the if statement. syncthreads and the second if are not required for the algorithm. This change makes the kernel work correctly, however, without any guarantee (I mean it's possible that for other types and scan op, for example, the workaround won't help).

ISA for this test: isa_memory_access_fault.zip

test_hip_device_scan-gfx803-workaround.isa and test_hip_device_scan-gfx900-workaround.isa, correct results:

    s_barrier                                                  // 0000000035A0: BF8A0000 
    v_cmp_ne_u32_e32 vcc, 0, v4                                // 0000000035A4: 7D9A0880 
    s_and_b64 vcc, exec, vcc                                   // 0000000035A8: 86EA6A7E 
    s_cbranch_vccz BB1_40                                      // 0000000035AC: BF860015

test_hip_device_scan-gfx900.isa (without workaround), correct results:

    v_cmp_ne_u32_e32 vcc, 0, v5                                // 00000000361C: 7D9A0A80 
    s_and_b64 vcc, exec, vcc                                   // 000000003620: 86EA6A7E 

    ... no vcc modification

    v_add_u32_e32 v28, -1, v28                                 // 000000003758: 683838C1 
    v_add_u32_e32 v26, v27, v26                                // 00000000375C: 6834351B 
    v_lshrrev_b32_e32 v27, 6, v0                               // 000000003760: 20360086 
    v_cmp_eq_u32_e64 s[6:7], v28, v0                           // 000000003764: D0CA0006 0002011C 
    s_cbranch_vccz BB1_64                                      // 00000000376C: BF860144

test_hip_device_scan-gfx803.isa (without workaround), incorrect results:

    ... no v_cmp_ne_u32_e32 vcc, 0, v5 equivalent

    v_add_u32_e32 v28, vcc, -1, v28                            // 00000000375C: 323838C1 
    s_and_b64 vcc, exec, vcc                                   // 000000003760: 86EA6A7E <-- totally unrelated value?
    v_cmp_eq_u32_e64 s[6:7], v28, v0                           // 000000003764: D0CA0006 0002011C 
    s_cbranch_vccz BB1_64                                      // 00000000376C: BF860149

Here v4 and v5 are unsinged int flat_block_id, i.e. the variable in the if.

Another kernel with a similar problem:

tabulate-gfx900.isa, correct results:

    v_cmp_ne_u32_e32 vcc, s6, v3                               // 000000002358: 7D9A0606 
    s_addc_u32 s2, s13, s3                                     // 00000000235C: 8202030D 
    v_mov_b32_e32 v1, 0                                        // 000000002360: 7E020280 
    s_and_b64 vcc, exec, vcc                                   // 000000002364: 86EA6A7E 
    v_add_u32_e32 v3, s0, v4                                   // 000000002368: 68060800 
    s_cbranch_vccz BB1_2                                       // 00000000236C: BF86000A

tabulate-gfx803.isa, incorrect results:

    v_add_u32_e32 v3, vcc, s0, v1                              // 000000002340: 32060200 
    s_and_b64 vcc, exec, vcc                                   // 000000002344: 86EA6A7E 
    v_mov_b32_e32 v1, 0                                        // 000000002348: 7E020280 
    s_cbranch_vccz BB1_2                                       // 00000000234C: BF86000A

In both cases, for gfx803, the compiler overwrites vcc register in v_add, so v_cmp_* vcc is completely removed (I suppose it was there just like for gfx900), s_cbranch_vccz proceeds with the ruined value of vcc.

b-sumner commented 5 years ago

@ex-rzr could you point back to the specific source lines which are being miscompiled and describe the changes you made to avoid the problem. Anything you can do to help us quickly zero in on the the problem without having to try to understand the entire application and rocPRIM library will save some time.

ex-rzr commented 5 years ago

@b-sumner I've created a branch with the workaround disabled and added some instructions. I hope, it's more clear now.

ex-rzr commented 5 years ago

This issue is not fixed in ROCm 2.3.

But good news: I managed to create a small test case without external dependencies: https://gist.github.com/ex-rzr/98e429dad13aa17ba1e67fc7145ba827

hipcc vcc.cpp -o vcc && ./vcc

The last 10 items are a "canary" area, they must be 0 (not modified by the kernel). Like this (MI25, gfx900):

However, on Fiji (gfx803)

...
995 995
996 996
997 997
998 998
999 999
1000    1000
1001    1001
1002    1002
1003    1003
1004    1004
1005    1005
1006    1006
1007    1007
1008    1008
1009    1009

(So it's just a good luck that there is no "Memory access fault").

The kernel looks weird and stupid, but the problem with this issue is that any change in the kernel, types, etc. may lead to correct instructions.

Here are ISAs: vcc.zip

Again, v_add* modifies vcc, s_cbranch_vcc* uses this irrelevant value (v_cmp* is missing):

    v_add_u32_e32 v3, vcc, v3, v0                              // 000000001140: 32060103 
    v_mov_b32_e32 v2, s1                                       // 000000001144: 7E040201 
    v_add_u32_e64 v1, s[0:1], s0, v1                           // 000000001148: D1190001 00020200 
    s_and_b64 vcc, exec, vcc                                   // 000000001150: 86EA6A7E 
    v_addc_u32_e64 v2, s[0:1], 0, v2, s[0:1]                   // 000000001154: D11C0002 00020480 
    s_cbranch_vccnz BB0_2                                      // 00000000115C: BF870007

gfx900 has correct instructions v_cmp_eq_u32_e32 vcc -> s_cbranch_vccnz.

    v_cmp_eq_u32_e32 vcc, s8, v3                               // 0000000011E4: 7D940608 
    s_and_b64 vcc, exec, vcc                                   // 0000000011E8: 86EA6A7E 
    v_add_u32_e32 v3, v4, v0                                   // 0000000011EC: 68060104 
    s_cbranch_vccnz BB0_2                                      // 0000000011F0: BF870007

ex-rzr commented 5 years ago

Another example (a test case from rocPRIM):

gfx803:

    v_add_u32_e32 v1, vcc, s1, v1                              // 00000000111C: 32020201 
    s_and_b64 vcc, exec, vcc                                   // 000000001120: 86EA6A7E 
    s_cbranch_vccnz BB0_2                                      // 000000001124: BF870073

gfx900:

    v_cmp_eq_u32_e32 vcc, s8, v1                               // 0000000011B8: 7D940208 
    s_and_b64 vcc, exec, vcc                                   // 0000000011BC: 86EA6A7E 
    v_add_u32_e32 v1, s2, v2                                   // 0000000011C0: 68020402 
    s_cbranch_vccnz BB0_2                                      // 0000000011C4: BF870069

test_hip_counting_iterator.zip

b-sumner commented 5 years ago

Thank you for the helpful new standalone code. And let me apologize for the delay on getting to this.

arsenm commented 5 years ago

https://github.com/llvm-mirror/llvm/commit/aa2f57df85af2056eafbef94380664a55570fa83 should fix this particular case

skeelyamd commented 5 years ago

@b-sumner, can you comment on when HCC might pickup this fix?

arsenm commented 5 years ago

On Jun 4, 2019, at 8:56 PM, Sean Keely notifications@github.com wrote:

@b-sumner https://github.com/b-sumner or @arsenm https://github.com/arsenm: It looks like the compiler fix is in. Can we close this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RadeonOpenCompute/hcc/issues/1024?email_source=notifications&email_token=AABBYY7YL3Y3Y334CSZLANTPY4FK5A5CNFSM4GUK4ZJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW6JCMA#issuecomment-498897200, or mute the thread https://github.com/notifications/unsubscribe-auth/AABBYY75EX4NKQO2VKFGS5TPY4FK5ANCNFSM4GUK4ZJQ.

I don’t have the option of closing it

scchan commented 5 years ago

Currently the fix is only available in HCC's development branch (clang_tot_upgrade), you'll have to build the compiler from source to get it.

ROCm / hcc

Memory access fault by GPU node-2 on Fiji gfx803 (with test cases and ISA) #1024