ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
290 stars 23 forks source link

Inaccurate device counter trace #318

Open sfantao opened 9 months ago

sfantao commented 9 months ago

Using as an example https://github.com/amd/HPCTrainingExamples/tree/main/HIPIFY/mini-nbody/hip, if I get device counters with rocprof using:

> cat $wd/counters.txt
pmc : WriteSize FetchSize
> bash -c "export ROCR_VISIBLE_DEVICES=0 ; rocprof -i $wd/counters.txt ./nbody-orig $((12*65536))"

I get:

Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,WriteSize,FetchSize
0,"bodyForce(Body*, float, int) [clone .kd]",4,0,0,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,36723.0000000000,524628.5625000000
1,"bodyForce(Body*, float, int) [clone .kd]",4,0,2,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,17505.1250000000,488091.6250000000
2,"bodyForce(Body*, float, int) [clone .kd]",4,0,4,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,17510.6250000000,487910.1250000000
3,"bodyForce(Body*, float, int) [clone .kd]",4,0,6,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,33072.5000000000,2820859.8125000000
4,"bodyForce(Body*, float, int) [clone .kd]",4,0,8,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,32875.0000000000,1719172.6875000000
5,"bodyForce(Body*, float, int) [clone .kd]",4,0,10,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,31081.0000000000,668958.1250000000
6,"bodyForce(Body*, float, int) [clone .kd]",4,0,12,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,17516.0000000000,488220.2500000000
7,"bodyForce(Body*, float, int) [clone .kd]",4,0,14,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,32861.8750000000,3522902.0625000000
8,"bodyForce(Body*, float, int) [clone .kd]",4,0,16,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,17505.0000000000,488151.7500000000
9,"bodyForce(Body*, float, int) [clone .kd]",4,0,18,148495,148495,786432,256,0,0,16,0,16,64,0x0,0x7f4abf7508c0,32938.8750000000,2949121.8750000000

If I use omniperf with a configuration containing:

OMNITRACE_ROCM_EVENTS                              = FetchSize:device=0 WriteSize:device=0

and run:

bash -c "export ROCR_VISIBLE_DEVICES=0 ; omnitrace-sample ./nbody-orig $((12*65536))"

I get:

image

i.e the counters do not show any fluctuation as they should trusting the rocprof output.

Tested on ROCm 5.7.0 and omnitrace omnitrace-1.10.4-ubuntu-20.04-ROCm-50700-PAPI-OMPT-Python3.sh.

For completeness on different machine and ROCm 5.6.1 I see things like:

image

Also no fluctuations but for the first kernel the reading starts correct but shifts in the middle of the kernel.

jrmadsen commented 8 months ago

There are a couple things going on here. I believe the default view of the timelines is the accumulation of the counters, so you will not see them fluctuate but instead, grow over time — if you click on the lightning bolt looking thing, you can change the view, I think one of them will be the delta. Second, there are likely some discrepancies from mapping hardware counters for kernels onto the kernel-independent timeline. Third, I don’t have a ton of confidence in the combination of the timing alignment between omnitrace’s current use of roctracer for kernel timing with the kernel timings reported by rocprofiler when it reports the HW counters — this needs to be investigated.