NVlabs / NVBit

220 stars 20 forks source link

Is it possible to use device memory (global memory) instead of unified memory (__managed__) to store the statistical data #98

Closed xinyi-li7 closed 2 years ago

xinyi-li7 commented 2 years ago

In your NVBit-tool examples, you always store the statistical data in unified memory. For example, in the opcode_hist tool, you are using __managed__ uint64_t histogram[MAX_OPCODES] to count the opcodes and pass it to inject function through nvbit_add_call_arg_const_val64(i, (uint64_t)histogram).

Since unified memory is expensive, I was wondering if we could allocate device memory (global memory, since shared/constant memory is not available) before a kernel is launched.

Now I am trying to allocate a "d_histogram" variable, and pass it to the injection function through the function nvbit_add_call_arg_const_val64(i, (uint64_t)d_histogram). But it doesn't seem to work: in the injection function, it reports "an illegal memory access."

Could you please confirm whether I can do that in order to debug my program?

Thank you in advance!

ovilla commented 2 years ago

The mechanism you describe should work. In which point are you allocating the device memory? If you attach a very small example reproducing the error I can take a look at it.

xinyi-li7 commented 2 years ago

Hi Oriste, Thanks for your response! Sure, I just attached the modified program here. opcode_hist.zip.

For your convenience, I post a diff gist of opcode_hist.cu.

In inject_funcs.cu (gist), if I print this pointer, it is 0, which is null. And I cannot access the content in this pointer.

I hope this information can help! Thank you so much!

ovilla commented 2 years ago

Because you should be passing:

nvbit_add_call_arg_const_val64(i, (uint64_t)d_histogram);

and not

nvbit_add_call_arg_const_val64(i, (uint64_t)*d_histogram);

Let me know if that fixes it.

xinyi-li7 commented 2 years ago

Oops, I forgot to modify this snippet; what I tested (ran on my computer) is nvbit_add_call_arg_const_val64(i, (uint64_t)d_histogram);.

The result I described before (the pointer is 0) is for the former one; the latter one (with * ) will just print segmentation fault

Sorry for this typo.

I attached the correct version here. opcode_hist.zip

ovilla commented 2 years ago

You are allocating/freeing inside the launch (so that pointer changes all the time), while instead the instrumentation is passing a constant value at the moment of the instrumentation. Moving cudaMalloc/cudaFree inside nvbit_at_ctx_init/nvbit_at_ctx_term respectively should solve your problem. If you need really to allocate and free memory at launch time it is more complicated, but it can be done.

xinyi-li7 commented 2 years ago

Ah, gotcha!

Can you give a hint on how to do it inside the kernel?

I will try to modify my algorithm so that I can keep a global table to record the data throughout the whole program. But my initial idea is to keep one table for each kernel so that I can have a fixed-size table in global memory. Just in case I still need a kernel scope table:-). Thanks!

ovilla commented 2 years ago

look at mem_trace.cu example, in particular at nvbit_add_call_arg_launch_val64(instr, 0);`` andnvbit_set_at_launch(ctx, p->f, (uint64_t)&grid_launch_id);```. Good luck, closing issue (as non issue).