CoreCLR sample profiling seems expensive

LakshanF commented 1 year ago

As part of exploring enabling the sample profiler in NativeAOT, I came across the CoreCLRimplementation and that seems relatively expensive. We suspend managed execution, iterate over managed threads to get the stack information, and then resume execution.

Wrt NativeAOT, it might not be necessary for the runtime to provide this service and perhaps NativeAOTcan get away with relying on platform profilers for this service.

Talking over this with @VSadov, it seems worthwhile to explore less expensive ways to do this in CoreCLRwhen large numbers of threads are involved.

Filing this issue to track the problem, relevant details of the call stack is copied in case its useful.

0:010> kn
Child-SP          RetAddr               Call Site
05 00000026`d707fba0 00007ffd`3d4baf1e     coreclr!ep_rt_coreclr_sample_profiler_write_sampling_event_for_threads+0x5c [runtime\src\coreclr\vm\eventing\eventpipe\ep-rt-coreclr.cpp @ 159] 
06 00000026`d707fbd0 00007ffd`3d4c4ac0     coreclr!ep_rt_sample_profiler_write_sampling_event_for_threads+0x1e [runtime\src\coreclr\vm\eventing\eventpipe\ep-rt-coreclr.h @ 565] 
07 00000026`d707fc00 00007ffd`3d4bb3f2     coreclr!sampling_thread+0xe0 [runtime\src\native\eventpipe\ep-sample-profiler.c @ 101] 
08 00000026`d707fcf0 00007ffd`ec0c26bd     coreclr!ep_rt_thread_coreclr_start_func+0x32 [runtime\src\coreclr\vm\eventing\eventpipe\ep-rt-coreclr.h @ 813] 
09 00000026`d707fd40 00007ffd`edd0a9f8     KERNEL32!BaseThreadInitThunk+0x1d
0a 00000026`d707fd70 00000000`00000000     ntdll!RtlUserThreadStart+0x28

VSadov commented 1 year ago

It was observed that when profiling concurrent workloads the effect of frequent runtime suspension alters the behavior of the application to the degree that it can't be considered the same scenario.

There are ways to make this much less intrusive by, for example, sending signals/APCs to the threads and perform sampling in the handlers. This way the sampling will scale with the app and will not stand in the way of concurrency.

alexrp commented 1 year ago

There are ways to make this much less intrusive by, for example, sending signals

I have a fair amount of experience in this area from my time maintaining Mono's log profiler (for Unix systems), where we did exactly this. I can share some insights here that might be useful for anyone doing work in this area:

At a certain point, regular signals (e.g. SIGPROF) are going to get throttled by the kernel.
- The kernel seems to prioritize delivery to the main thread.
- This can get so bad that you get little-to-literally-no data for some threads.
- The throttling is process-wide, so no matter what you do, this can't actually scale with number of threads.
- macOS seems to throttle much more aggressively than e.g. Linux.
- macOS doesn't allow you to pthread_kill GCD threads (ESRCH).
Real-time signals don't suffer from throttling.
- Some older Android versions don't support them properly (small sigset_t, incorrect SIGRTMIN/SIGRTMAX), but I doubt this is relevant today.
- They are not supported on macOS. AFAIK, thread suspension/resumption is still your best bet there.
- Spamming threads with RT signals is subject to similar issues to thread suspension/resumption, so should be done with care.
- An RT signal should be ack'd before sending the next signal to prevent signal queue overflow (EAGAIN), which, in Mono's case, caused the whole STW machinery to break.
It is very difficult to write correct code in signal handlers.
- This is especially true when you need to perform a task as (relatively) complex as walking the stack at an arbitrary point and recording an arbitrary number of frames.
- On some platforms, profiling signals can arrive before a thread has even had a chance to initialize TLS, making TLS access in the handler unsafe. Coordination with the runtime's thread manager is required to avoid this.
- For loadable profilers (i.e. in .so/.dylib), it's important to load them with RTLD_NOW, since lazy symbol binding happening in a callback invoked from a signal handler can break horribly.

VSadov commented 1 year ago

@alexrp Thanks! This is very useful info.

We have seen some of these problems as a part of working on GC suspension, which is also based on signals, but some concerns are new - i.e. throttling. I guess GC suspension is just not frequent enough to suffer from that.

Many concerns like signal safety are mitigated in GC suspension scenarios by the fact that we are only interested in stopping at safe points and that must be in managed code and in most cases at call sites. At safe points signal safety can be ignored as we know what managed code could or could not be doing. (can't be inside the kernel or holding malloc locks for example) Perhaps such constraint is acceptable. The current GC suspension-based tracing would have the same limitation anyways.

The throttling is a new concern though, although it must be already affecting the current suspension-based tracing, since signals are used (RT where available, ordinary signals otherwise).

Basically - yes. It looks like it might be more challenging than it seems at first, but it also looks like all the concerns are applicable to the current approach, so it probably can't get worse than what it is today.

alexrp commented 1 year ago

We have seen some of these problems as a part of working on GC suspension, which is also based on signals, but some concerns are new - i.e. throttling. I guess GC suspension is just not frequent enough to suffer from that.

That sounds likely. In Mono we will opportunistically use RT signals for both GC and profiling if they are available, and fall back to regular signals if not.

Many concerns like signal safety are mitigated in GC suspension scenarios by the fact that we are only interested in stopping at safe points and that must be in managed code and in most cases at call sites. At safe points signal safety can be ignored as we know what managed code could or could not be doing. (can't be inside the kernel or holding malloc locks for example) Perhaps such constraint is acceptable. The current GC suspension-based tracing would have the same limitation anyways.

That might be okay. It still makes me super nervous to do non-async-signal-safe stuff in a signal handler, but maybe it can work in practice under these controlled circumstances. You're essentially still relying on undefined behavior, though.

The throttling is a new concern though, although it must be already affecting the current suspension-based tracing, since signals are used (RT where available, ordinary signals otherwise).

I would assume that the throttling isn't necessarily a huge concern for suspension since you're likely to always make forward progress as long as you keep retrying the suspension signal on threads that haven't ack'd it yet (in the non-RT-signal case). Sampling signals are more troublesome because you usually want a guaranteed X number of signals multiplied by Y number of threads per second, in order to get high quality data.

tommcdon commented 1 year ago

After careful consideration, we do not plan to action this particular item in this release. We will continue to evaluate it for future releases. Ideally, we would like to fix every issue and implement every idea people submit. Realistically, we cannot address every item.

VSadov commented 1 year ago

Makes sense. On the benefit/cost/risk scale this may not be the most urgent item to go after.

dotnet / runtime

CoreCLR sample profiling seems expensive #83754