Closed mmeeks closed 1 month ago
Seems like if we have a magic symbol we can add a probe point on it remotely for perf: perf probe -x /bin/zsh zzfree for example: perf probe --list should show that - so just need to add a simple watchdog I guess.
https://github.com/CollaboraOnline/online/compare/private/mmeeks/watchdog?expand=1 has a watchdog for this.
This was done with https://github.com/CollaboraOnline/online/pull/8556 Automated upload of watchdog profiles for e.g. 24.04 are at https://github.com/caolanm/profiles/tree/co-24.04/watchdog
We have nice profiles via perf of things going on - but it is hard to focus these on where users experience latency.
We should add a watchdog timer to the 'Kit' process whereby we have a thread that wakes up every ~5ms or so and if we've not entered the Kit main-loop in the last ~10 milliseconds it starts doing an unusual system-call / operation.
https://www.brendangregg.com/perf.html
perf probe --add tcp_sendmsg # and https://docs.kernel.org/trace/tracepoints.html
I believe we can sample all CPU processes with perf record -a so we would get more than just our watchdog process.
I imagine something like the 'access' system-call would be a good one as a 1st cut; its not frequently used.
@grandinj or @caolanm might be a nice 31337 quick-hack - and I expect this belongs mostly in Kit - but we could prolly do it in wsd as well. I imagine we could use other triggers than time-since-last-wakeup - such as growing queue length, or socket buffer size to find other problems too.