Closed AllenLyu closed 1 month ago
I found that my onnxruntime lib is stripped. when I use debug version lib, I got what I want.
Another question, how can I get some time cost info ?
Hi Allen, how did you build strobelight or is there a prebuilt binary?
Hi @slowbreathing - I added build instructions, please give it a try
Closing since issue was resolved by not stripping symbols
Hi Riham Thanks a lot. Was trying it on ubuntu 22.04. For those trying it out, install libfmt-dev and the build should go through.
Hi Riham, I have 2 quick questions. And thanks in advance.
Hi @slowbreathing, thanks for your questions, here are some thoughts:
We don't support this out of the box but it seems like it could be added easily, we just need to update the handle_event
function in to log the events to a file in a format that framegraph would understand, I am looking at the FlameGraph repo and the format seems pretty straight forward as it seems to be a flat list of stacks (e.g. https://github.com/brendangregg/FlameGraph/blob/master/example-dtrace-stacks.txt).
Can you explain more what you mean by this?
but obviously not all addresses are recognised
Is strobelight not able to find the cudaMalloc
symbol at all? We should be able to locate it and handle it, but the bpf code would look different - this is an example handler that we plan to open source soon:
SEC("uprobe")
int BPF_KPROBE(handle_cuda_malloc_enter, void** devPtr, size_t size) {
bpf_printk("Malloc Enter addr = 0x%llx!", devPtr);
uint64_t pid_tgid = bpf_get_current_pid_tgid();
struct gpu_alloc_request_t alloc_request;
alloc_request.ptr_addr = (uint64_t)devPtr;
alloc_request.size = size;
bpf_map_update_elem(&alloc_requests, &pid_tgid, &alloc_request, BPF_NOEXIST);
return 0;
}
Hi Riham,
Thank you very much for your very very quick responses.
Yes the flamegraph format is very simple, was just looking for some tool or option that did this already. I remember seeing a youtube video of yours "https://www.youtube.com/watch?v=5xAghByteYc&t=349s" where you show a screen shot. I was wondering if that was part of the open sourced strobelight .
cudaMalloc is not does not trigger and it is not present as uprobe in libcuda.so. I checked using bpftrace -l. I did trace cuMemAllocAsync and I got the below trace.
python [205940] KERNEL [0x7f3b519e3948] STREAM 0x7f39fa03df49 GRID (3840000,0,-134197856) BLOCK (2,0,-134198336) [Unknown] Stack: 00000000003390d0: cuMemAllocAsync @ 0x3390d0+0x0
Throw some light. And thanks again for your quick responses.
Will eagerly wait for cudaMalloc to be opensourced
Hi Riham, Small update, we have a model that takes voice and generate both SQL and mongoQL. Internally it is made up of 2-3 different models, one of which is a variation of T5. The gpuevent_snoop does not terminate for this application. Attaching the log file.
[Uploading cudaKL.txt…]()
Hi @slowbreathing,
chrome://tracing
, e.g. this is an example trace that can be opened with chrome://tracing
or perfetto{
"traceEvents": [
{"name":"add_vectors","ts":16257092434309,"dur":1000,"ph":"X","tid":"3110469","pid":"3110467"},
{"name":"test1","ts":16257092444309,"dur":1000,"ph":"X","tid":"3110469","pid":"3110467"},
{"name":"add_vectors2","ts":16257092454309,"dur":1000,"ph":"X","tid":"3110469","pid":"3110467"},
{"name":"test2","ts":16257092464309,"dur":1000,"ph":"X","tid":"3110469","pid":"3110467"},
{"name":"do_stuff1","ts":16257092434309,"dur":1000,"ph":"X","tid":"3110467","pid":"3110467"},
{"name":"do_stuff2","ts":16257092444309,"dur":1000,"ph":"X","tid":"3110467","pid":"3110467"},
{"name":"do_stuff3","ts":16257092454309,"dur":1000,"ph":"X","tid":"3110467","pid":"3110467"},
{"name":"do_stuff4","ts":16257092464309,"dur":1000,"ph":"X","tid":"3110467","pid":"3110467", "sf": 9},
{}
],
"stackFrames": {
"5": { "name": "main", "category": "my app" },
"7": { "parent": "5", "name": "parent_frame", "category": "my app" },
"8": { "parent": "7", "name": "parent_frame2", "category": "my app" },
"9": { "parent": "7", "name": "do_stuff4", "category": "my app" }
}
}
cudaMalloc
could be statically linked, but gpuevent_snoop
should still be able to locate it, can you try to manually search for the symbol? I use this quick command:export pid=<your process id>
cat /proc/$pid/maps | cut -c 74- | sort | uniq | sort -n | while read line; do nm /proc/$pid/root$line 2>/dev/null | if [ "$(grep cudaMalloc)" ]; then echo /proc/$pid/root$line; fi ; done
I will also check in a test program that can be used for testing cudaMalloc
attachment
Hello ! I'm trying use this wonderful tool to profile my cuda application, but when I attach to my progress, I didn't get some infomations. Here's my command:
./gpuevent_snoop --pid xxx -asv
and my process to be attached is a simple progress that inference onnx model by onnxruntime, infinite loop.
and the command output like below:
Found Symbol cudaLaunchKernel at /home/didi/code/model_benchmark/build/model_tester Offset: 0x97100 Found Symbol cudaLaunchKernel at /usr/lib/libonnxruntime_providers_cuda.so Offset: 0x0 Found Symbol cudaLaunchKernel at /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so.11.8.89 Offset: 0x6c140 Started profiling at Tue Jul 30 15:24:36 2024 model_tester [2612112] KERNEL [0x784923f12430] STREAM 0x611cdf0e7bb0 GRID (3,1,1) BLOCK (256,1,1) [Unknown] Args: Stack: 000000000006c140: cudaLaunchKernel @ 0x6c140+0x0
so, what show I do to get more infomations ?