NVlabs / NVBit

220 stars 20 forks source link

Different statistics from mem_trace and opcode_hist #68

Open mahmoodn opened 3 years ago

mahmoodn commented 3 years ago

Hi I have noticed that some statistics are counted differently with different tools. For example in one of the codes in SDK, 6_Advanced/cdpQuadtree, I tried opcode_hist and mem_trace and saw that the number of STG and LDG instructions are different. I paste the output of opcode_hist for STG/LDG related instructions here:

GPU device GeForce RTX 3080 has compute capabilities (SM 8.6)
kernel 0 - void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 32
kernel 1 - void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 32
kernel 2 - void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 32
kernel 3 - void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 32
kernel 4- void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 64
kernel 5 - void build_quadtree_kernel<128>...
  LDG.E = 4950
  LDG.E.64 = 18594
  STG.E = 3816
  STG.E.64 = 2268
Results: OK

Next, I ran mem_trace and redirected the output to a file and counted the number of STG.E instances.

$ grep STG.E mem_trace.txt | wc -l
868
$ grep STG.E.64 mem_trace.txt | wc -l
252
$ grep LDG.E.64 mem_trace.txt | wc -l
2066

I also see that memory traces for kernel_agent are not printed. I mean it lacks some STG.E strings. Do you confirm that? Any thoughts?

UPDATE:

I see that for LDG.E.64 18594/2066 is 9 and for STG.E.64 2268/252 is also 9. Don't know what is the interpretation of that 9...

ovilla commented 3 years ago

It is possible there is a bug somewhere (and we will check for it), but the 2 tools are very different.

You could say mem_trace is more accurate (and slower), while opcode_hist provides an upper bound of what is executed (but it is very fast as it only instruments once per basic block). But again the tools are doing very different things.

For instance, predicated off instructions are logically executed (with a zeroed mask) and thus could be of interest for something like "opcode_hist", this is not true if you only care about seeing memory references.

In general the tools are examples of how to use nvbit rather than provide specific profiling functionality, and for that NVIDIA supported profiling tools are already excellent.

Nvbit is more about giving you freedom to instrument exactly what you need. You should modify and adapt those tools to your needs. For instance, do you care about predicated off instructions? Do you care about thread level count or warp level count? etc...

mahmoodn commented 3 years ago

mem_trace instead prints memory instructions (at instruction level) and does not print any line if the instruction is entirely predicated off.

You mean this part in the mem_trace?

    /* if thread is predicated off, return */
    if (!pred) {
        return;
    }

I commented that part and reran mem_trace hoping to see memory traces of kernels that I didn't see in the original post. Still I don't see them. Don't know what is the effect of that condition then.