Open LakshanF opened 1 year ago
It was observed that when profiling concurrent workloads the effect of frequent runtime suspension alters the behavior of the application to the degree that it can't be considered the same scenario.
There are ways to make this much less intrusive by, for example, sending signals/APCs to the threads and perform sampling in the handlers. This way the sampling will scale with the app and will not stand in the way of concurrency.
There are ways to make this much less intrusive by, for example, sending signals
I have a fair amount of experience in this area from my time maintaining Mono's log profiler (for Unix systems), where we did exactly this. I can share some insights here that might be useful for anyone doing work in this area:
SIGPROF
) are going to get throttled by the kernel.
pthread_kill
GCD threads (ESRCH
).sigset_t
, incorrect SIGRTMIN
/SIGRTMAX
), but I doubt this is relevant today.EAGAIN
), which, in Mono's case, caused the whole STW machinery to break..so
/.dylib
), it's important to load them with RTLD_NOW
, since lazy symbol binding happening in a callback invoked from a signal handler can break horribly.@alexrp Thanks! This is very useful info.
We have seen some of these problems as a part of working on GC suspension, which is also based on signals, but some concerns are new - i.e. throttling. I guess GC suspension is just not frequent enough to suffer from that.
Many concerns like signal safety are mitigated in GC suspension scenarios by the fact that we are only interested in stopping at safe points and that must be in managed code and in most cases at call sites. At safe points signal safety can be ignored as we know what managed code could or could not be doing. (can't be inside the kernel or holding malloc locks for example) Perhaps such constraint is acceptable. The current GC suspension-based tracing would have the same limitation anyways.
The throttling is a new concern though, although it must be already affecting the current suspension-based tracing, since signals are used (RT where available, ordinary signals otherwise).
Basically - yes. It looks like it might be more challenging than it seems at first, but it also looks like all the concerns are applicable to the current approach, so it probably can't get worse than what it is today.
We have seen some of these problems as a part of working on GC suspension, which is also based on signals, but some concerns are new - i.e. throttling. I guess GC suspension is just not frequent enough to suffer from that.
That sounds likely. In Mono we will opportunistically use RT signals for both GC and profiling if they are available, and fall back to regular signals if not.
Many concerns like signal safety are mitigated in GC suspension scenarios by the fact that we are only interested in stopping at safe points and that must be in managed code and in most cases at call sites. At safe points signal safety can be ignored as we know what managed code could or could not be doing. (can't be inside the kernel or holding malloc locks for example) Perhaps such constraint is acceptable. The current GC suspension-based tracing would have the same limitation anyways.
That might be okay. It still makes me super nervous to do non-async-signal-safe stuff in a signal handler, but maybe it can work in practice under these controlled circumstances. You're essentially still relying on undefined behavior, though.
The throttling is a new concern though, although it must be already affecting the current suspension-based tracing, since signals are used (RT where available, ordinary signals otherwise).
I would assume that the throttling isn't necessarily a huge concern for suspension since you're likely to always make forward progress as long as you keep retrying the suspension signal on threads that haven't ack'd it yet (in the non-RT-signal case). Sampling signals are more troublesome because you usually want a guaranteed X number of signals multiplied by Y number of threads per second, in order to get high quality data.
After careful consideration, we do not plan to action this particular item in this release. We will continue to evaluate it for future releases. Ideally, we would like to fix every issue and implement every idea people submit. Realistically, we cannot address every item.
Makes sense. On the benefit/cost/risk scale this may not be the most urgent item to go after.
As part of exploring enabling the sample profiler in
NativeAOT
, I came across theCoreCLR
implementation and that seems relatively expensive. We suspend managed execution, iterate over managed threads to get the stack information, and then resume execution.Wrt
NativeAOT
, it might not be necessary for the runtime to provide this service and perhapsNativeAOT
can get away with relying on platform profilers for this service.Talking over this with @VSadov, it seems worthwhile to explore less expensive ways to do this in
CoreCLR
when large numbers of threads are involved.Filing this issue to track the problem, relevant details of the call stack is copied in case its useful.