Curious about how it works

stricklandye commented 10 months ago

Hi guys. I have recently been invesgating whether there are some approaches to profile GPU program at runtime without intrusion into source code (i.e. no need to restart the program to be profiled and modify source code ). It seems pytorch profiler (kineto) and combined with dynolog are pretty close to the purpose beside env variable KINETO_USE_DAEMON = 1 must be set when executing pytorch program.

So, I am curious the principle under the hood and I have some questions:

what's purpose of KINETO_USE_DAEMON = 1 ? According to the blog posted by some people from Meta, I quote a paragrah from the blog:

First, we modified PyTorch to register with the Dynolog daemon on start up. This feature is switched on by setting the environment variable KINETO_USE_DAEMON=True. With this environment variable set to True, the PyTorch Profiler periodically polls Dynolog to check for on-demand tracing requests.

So my understanding is pytorch will collect trace info by default and only user request for tracing and KINETO_USE_DAEMON = 1 is set, then pytroch profiler will send trace info it have collected to user. If the env variable was not set , pytorch profiler just discards everything. Am I right ?
How does pytorch profiler work? Maybe it's not a suitable place to discuss about kineto if not I will open a issue on kineto repo. After having a quick look on kineto source code, I guess the mystery behind kineto may be CUPTI. I am wondering that can I implement no-intrusion profiling using CUPTI. For example, a AI application is running and using command like profile-gpu --pid=$PID (profile-gpu is a profiling tool implemented by using CUPTI )to collect trace info (no need attach my hook then restart application) so I don't know if it's feasible.
Is there are other way to implement no-intrusion profiling? eBPF maybe a good choice. The only relevant doc I can find about AI observability is here but the eBPF part of it is not open source :(.

Thanks in advance, I hope you can give me some insights. :D

briancoutinho commented 9 months ago

So my understanding is pytorch will collect trace info by default and only user request for tracing and KINETO_USE_DAEMON = 1 is set, then pytroch profiler will send trace info it have collected to user. If the env variable was not set , pytorch profiler just discards everything. Am I right ?

Actually, the way it works is PyTorch profiler is a layer above Kineto(the CUPTI/GPU profiling library).

When KINETO_USE_DAEMON=1 Kineto will check for any dynolog instance running, if so it periodically polls dynolog
As a user you can make a remote tracing request using Dyno CLI tool, this goes over the network to the dynolog instance.
Next time Kineto polls dynolog it gets the tracing command.
Kineto then switches on PyTorch profiler instrumentation too so it can get the function/operators names and annotations. In addition, it also turns on CUPTI that adds CUDA runtime and GPU kernel tracing information.

Kineto basically acts as a control center for tracing inside the application. Note that it also works with AMD's ROCM API that is basically CUPTI for AMD.

If the env variable was not set , pytorch profiler just discards everything. Am I right ?

Not really, you can still use pytorch profiler using its Python API. The env variable mainly asks Kineto talk to dynolog on the server. https://github.com/search?q=repo%3Apytorch%2Fpytorch%20KINETO_USE_DAEMON&type=code One more detail, the env variable automatically adds iteration markers using a hook.

Maybe this may clarify a bit. Irrespective of the env variable PyTorch can switch between profiling ON and OFF modes pretty freely. You never need to re-run the program. However, you have to manually write a profiler.start() and stop() in your pytorch program. The main feature dynolog adds is the ability to do this externally using a command 'dyno gputrace --pid $PID'. This is essentially what you are looking for.

So using dynolog and Pytorch/Kineto you basically should have non-intrusive on-demand tracing.

PS: The env variable bit is just a safety/gating thing so we do not blow up stuff for all PyTorch users :)

briancoutinho commented 9 months ago

Is there are other way to implement no-intrusion profiling? eBPF maybe a good choice

Yes folks at Meta have added eBPF based tool but its not open source :( However, it cannot obtain GPU side kernel timing, basically that is in NVIDIA's side of things and they need to support eBPF, we tried asking but its not here yet :( :( :( The eBPF approach is definitely helpful for kernel launch information, debugging memory allocations/deallocs, but we will need to open source it.

stricklandye commented 9 months ago

Thanks for your reply. Hope you have a nice holiday :D. KINETO_USE_DAEMON=1 just let kineto to check whether there is a tracing request from dynlog CLI,if true, then kineto switch on pytorch profiler and transfer tracing info to dynolog user. right?

briancoutinho commented 9 months ago

Yes, that is correct :)

stricklandye commented 9 months ago

I see. Thanks for your patience.

facebookincubator / dynolog

Curious about how it works #195