Can't get function name and args

AllenLyu commented 1 month ago

Hello ! I'm trying use this wonderful tool to profile my cuda application, but when I attach to my progress, I didn't get some infomations. Here's my command:

./gpuevent_snoop --pid xxx -asv

and my process to be attached is a simple progress that inference onnx model by onnxruntime, infinite loop.

and the command output like below:

Found Symbol cudaLaunchKernel at /home/didi/code/model_benchmark/build/model_tester Offset: 0x97100 Found Symbol cudaLaunchKernel at /usr/lib/libonnxruntime_providers_cuda.so Offset: 0x0 Found Symbol cudaLaunchKernel at /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so.11.8.89 Offset: 0x6c140 Started profiling at Tue Jul 30 15:24:36 2024 model_tester [2612112] KERNEL [0x784923f12430] STREAM 0x611cdf0e7bb0 GRID (3,1,1) BLOCK (256,1,1) [Unknown] Args: Stack: 000000000006c140: cudaLaunchKernel @ 0x6c140+0x0

so, what show I do to get more infomations ?

AllenLyu commented 1 month ago

I found that my onnxruntime lib is stripped. when I use debug version lib, I got what I want.

Another question, how can I get some time cost info ?

slowbreathing commented 1 month ago

Hi Allen, how did you build strobelight or is there a prebuilt binary?

RihamSelim commented 1 month ago

Hi @slowbreathing - I added build instructions, please give it a try

RihamSelim commented 1 month ago

Closing since issue was resolved by not stripping symbols

slowbreathing commented 1 month ago

Hi Riham Thanks a lot. Was trying it on ubuntu 22.04. For those trying it out, install libfmt-dev and the build should go through.

slowbreathing commented 1 month ago

Hi Riham, I have 2 quick questions. And thanks in advance.

Is there a way to format strobelight output for flamegraph? I mean an existing option or tool?
How do I get a trace to cudaMalloc ? I tried strobelight but only got addresses. AFAIK the 'uprobe:/lib/x86_64-linux-gnu/libcuda.so.535.104.12:cuMemAllocAsync* works with bpftrace, but obviously not all addresses are recognised. I have also tried perf and bcc profiler, they all work but require painstaking unstriping for all libraries. My question is how do i effectively trace cuda malloc or some version of it.

RihamSelim commented 1 month ago

Hi @slowbreathing, thanks for your questions, here are some thoughts:

We don't support this out of the box but it seems like it could be added easily, we just need to update the handle_event function in to log the events to a file in a format that framegraph would understand, I am looking at the FlameGraph repo and the format seems pretty straight forward as it seems to be a flat list of stacks (e.g. https://github.com/brendangregg/FlameGraph/blob/master/example-dtrace-stacks.txt).
Can you explain more what you mean by this?

but obviously not all addresses are recognised

Is strobelight not able to find the cudaMalloc symbol at all? We should be able to locate it and handle it, but the bpf code would look different - this is an example handler that we plan to open source soon:

SEC("uprobe")
int BPF_KPROBE(handle_cuda_malloc_enter, void** devPtr, size_t size) {
  bpf_printk("Malloc Enter addr = 0x%llx!", devPtr);
  uint64_t pid_tgid = bpf_get_current_pid_tgid();

  struct gpu_alloc_request_t alloc_request;
  alloc_request.ptr_addr = (uint64_t)devPtr;
  alloc_request.size = size;

  bpf_map_update_elem(&alloc_requests, &pid_tgid, &alloc_request, BPF_NOEXIST);
  return 0;
}

slowbreathing commented 1 month ago

Hi Riham,

Thank you very much for your very very quick responses.

Yes the flamegraph format is very simple, was just looking for some tool or option that did this already. I remember seeing a youtube video of yours "https://www.youtube.com/watch?v=5xAghByteYc&t=349s" where you show a screen shot. I was wondering if that was part of the open sourced strobelight .
cudaMalloc is not does not trigger and it is not present as uprobe in libcuda.so. I checked using bpftrace -l. I did trace cuMemAllocAsync and I got the below trace.

Found Symbol cuMemAllocAsync at tensorflow/core/kernels/libtfkernel_sobol_op.so Offset: 0x853b0 Found Symbol cuMemAllocAsync at tensorflow/python/framework/_test_metrics_util.so Offset: 0x41b90 Found Symbol cuMemAllocAsync at tensorflow/python/util/_pywrap_tfprof.so Offset: 0x42830 Found Symbol cuMemAllocAsync at tensorflow/compiler/tf2tensorrt/_pywrap_py_utils.so Offset: 0x47690 Found Symbol cuMemAllocAsync at tensorflow/python/profiler/internal/_pywrap_profiler.so Offset: 0x11ef90 Found Symbol cuMemAllocAsync at tensorflow/python/_pywrap_parallel_device.so Offset: 0x4e830 Found Symbol cuMemAllocAsync at tensorflow/python/util/_pywrap_checkpoint_reader.so Offset: 0x47a60 Found Symbol cuMemAllocAsync at tensorflow/python/framework/_proto_comparators.so Offset: 0x44d30 Found Symbol cuMemAllocAsync at tensorflow/libtensorflow_framework.so.2 Offset: 0x17e49f0 Found Symbol cuMemAllocAsync at tensorflow/python/_pywrap_tensorflow_internal.so Offset: 0x10a24210 Found Symbol cuMemAllocAsync at /usr/lib/x86_64-linux-gnu/libcuda.so.535.104.12 Offset: 0x3390d0 Started profiling at Sun Aug 4 20:55:47 2024 python [205940] KERNEL [0x7f3b519e39a8] STREAM 0x7f39fa03df49 GRID (768000,0,-134197856) BLOCK (2,0,-134198336) [Unknown] Stack: 00000000003390d0: cuMemAllocAsync @ 0x3390d0+0x0

python [205940] KERNEL [0x7f3b519e3938] STREAM 0x7f39fa03df49 GRID (768000,0,-134197856) BLOCK (2,0,-134198336) [Unknown] Stack: 00000000003390d0: cuMemAllocAsync @ 0x3390d0+0x0

python [205940] KERNEL [0x7f3b519e3948] STREAM 0x7f39fa03df49 GRID (7680000,0,-134197856) BLOCK (2,0,-134198336) [Unknown] Stack: 00000000003390d0: cuMemAllocAsync @ 0x3390d0+0x0

python [205940] KERNEL [0x7f3b519e3948] STREAM 0x7f39fa03df49 GRID (9436176,0,-134197856) BLOCK (2,0,-1133326336) [Unknown] Stack: 00000000003390d0: cuMemAllocAsync @ 0x3390d0+0x0

python [205940] KERNEL [0x7f3b519e3948] STREAM 0x7f39fa03df49 GRID (3840000,0,-134197856) BLOCK (2,0,-134198336) [Unknown] Stack: 00000000003390d0: cuMemAllocAsync @ 0x3390d0+0x0

Throw some light. And thanks again for your quick responses.

Will eagerly wait for cudaMalloc to be opensourced

slowbreathing commented 1 month ago

Hi Riham, Small update, we have a model that takes voice and generate both SQL and mongoQL. Internally it is made up of 2-3 different models, one of which is a variation of T5. The gpuevent_snoop does not terminate for this application. Attaching the log file.

[Uploading cudaKL.txt…]()

RihamSelim commented 1 month ago

Hi @slowbreathing,

That specific flame graph is actually from Chrome Trace Format file opened with chrome://tracing, e.g. this is an example trace that can be opened with chrome://tracing or perfetto

{
    "traceEvents": [
        {"name":"add_vectors","ts":16257092434309,"dur":1000,"ph":"X","tid":"3110469","pid":"3110467"},
        {"name":"test1","ts":16257092444309,"dur":1000,"ph":"X","tid":"3110469","pid":"3110467"},
        {"name":"add_vectors2","ts":16257092454309,"dur":1000,"ph":"X","tid":"3110469","pid":"3110467"},
        {"name":"test2","ts":16257092464309,"dur":1000,"ph":"X","tid":"3110469","pid":"3110467"},

        {"name":"do_stuff1","ts":16257092434309,"dur":1000,"ph":"X","tid":"3110467","pid":"3110467"},
        {"name":"do_stuff2","ts":16257092444309,"dur":1000,"ph":"X","tid":"3110467","pid":"3110467"},
        {"name":"do_stuff3","ts":16257092454309,"dur":1000,"ph":"X","tid":"3110467","pid":"3110467"},
        {"name":"do_stuff4","ts":16257092464309,"dur":1000,"ph":"X","tid":"3110467","pid":"3110467", "sf": 9},
        {}
    ],
    "stackFrames": {
        "5": { "name": "main", "category": "my app" },
        "7": { "parent": "5", "name": "parent_frame", "category": "my app" },
        "8": { "parent": "7", "name": "parent_frame2", "category": "my app" },
        "9": { "parent": "7", "name": "do_stuff4", "category": "my app" }
    }
  }

The log file is not working for me, cudaMalloc could be statically linked, but gpuevent_snoop should still be able to locate it, can you try to manually search for the symbol? I use this quick command:

export pid=<your process id>

cat /proc/$pid/maps | cut -c 74- | sort | uniq | sort -n | while read line; do nm /proc/$pid/root$line 2>/dev/null | if [ "$(grep cudaMalloc)" ]; then echo /proc/$pid/root$line; fi ; done

I will also check in a test program that can be used for testing cudaMalloc attachment

facebookincubator / strobelight