Snektron / goniometer

6 stars 0 forks source link

Supporting gfx9 ? #2

Open Epliz opened 1 month ago

Epliz commented 1 month ago

Hi,

Thanks for this very interesting project, it is very cool to someone taking matters in their own hands and trying to get the instruction profiling work for HIP on Linux.

Do you think it might be possible to add support for GFX9 gpus? In particular I have a MI100 GPU (gfx908) for which I would be quite interested in being able to see instruction profiles for. I have no idea if it would just work similarly to other GFX9 chips, and if the Mesa code would be sufficient as source of information. Maybe worth trying?

Best regards, Epliz

Snektron commented 1 month ago

Unfortunately the instruction tracing functionality is only available on RDNA2+ afaik, so gfx10 or newer. There is some other information that can be fetched from the GPU, though most of those are covered by rocprof as well.

Snektron commented 1 month ago

By the way, according to the ROCm docs, rocprofv2 can also gather these instruction traces now.

Epliz commented 1 month ago

Thanks for the reply and for pointing out to rocprofv2. I have actually just tried rocprofv2 and it seems to be able to collect instruction traces for gfx908 just fine (so extending your support for GFX9 would probably be possible).

The output though is pretty hard to use though, and it would be much better to have something loadable in RGP. Here is an example of a trace from profiling the babel-stream copy kernel:

Addr,Instruction,Hitcount,Cycles,C++ Reference
0x0,; Begin ATT ASM,0,0,
0x7a610c016c00,; _Z11copy_kernelIdEvPKT_PS0_,0,0,
0x7a610c016c00,"s_load_dword s7, s[4:5], 0x1c",13152,76092,
0x7a610c016c08,"s_load_dwordx4 s[0:3], s[4:5], 0x0",13152,65744,
0x7a610c016c10,"v_mov_b32_e32 v1, 0",13152,52608,
0x7a610c016c14,s_waitcnt lgkmcnt(0),13152,381348,
0x7a610c016c18,"s_and_b32 s4, s7, 0xffff",13152,78828,
0x7a610c016c20,"s_mul_i32 s6, s6, s4",13152,65760,
0x7a610c016c24,"v_add_u32_e32 v0, s6, v0",13152,52608,
0x7a610c016c28,"v_lshlrev_b64 v[0:1], 3, v[0:1]",13152,79764,
0x7a610c016c30,"v_mov_b32_e32 v3, s1",13152,101588,
0x7a610c016c34,"v_add_co_u32_e32 v2, vcc, s0, v0",13152,65756,
0x7a610c016c38,"v_addc_co_u32_e32 v3, vcc, v3, v1, vcc",13152,54184,
0x7a610c016c3c,"global_load_dwordx2 v[2:3], v[2:3], off",13152,305720,
0x7a610c016c44,"v_mov_b32_e32 v4, s3",13152,52792,
0x7a610c016c48,"v_add_co_u32_e32 v0, vcc, s2, v0",13152,52632,
0x7a610c016c4c,"v_addc_co_u32_e32 v1, vcc, v4, v1, vcc",13152,52608,
0x7a610c016c50,s_waitcnt vmcnt(0),13152,40815672,
0x7a610c016c54,"global_store_dwordx2 v[0:1], v[2:3], off",13152,54452,
Snektron commented 1 month ago

Ah, that's interesting. Perhaps it works via a different mechanism... Or maybe it just works, I haven't tested it.