perf + watchdog - Githubissues

mmeeks commented 8 months ago

We have nice profiles via perf of things going on - but it is hard to focus these on where users experience latency.

We should add a watchdog timer to the 'Kit' process whereby we have a thread that wakes up every ~5ms or so and if we've not entered the Kit main-loop in the last ~10 milliseconds it starts doing an unusual system-call / operation.

https://www.brendangregg.com/perf.html

perf probe --add tcp_sendmsg # and https://docs.kernel.org/trace/tracepoints.html

I believe we can sample all CPU processes with perf record -a so we would get more than just our watchdog process.

I imagine something like the 'access' system-call would be a good one as a 1st cut; its not frequently used.

@grandinj or @caolanm might be a nice 31337 quick-hack - and I expect this belongs mostly in Kit - but we could prolly do it in wsd as well. I imagine we could use other triggers than time-since-last-wakeup - such as growing queue length, or socket buffer size to find other problems too.

mmeeks commented 7 months ago

Seems like if we have a magic symbol we can add a probe point on it remotely for perf: perf probe -x /bin/zsh zzfree for example: perf probe --list should show that - so just need to add a simple watchdog I guess.

mmeeks commented 7 months ago

https://github.com/CollaboraOnline/online/compare/private/mmeeks/watchdog?expand=1 has a watchdog for this.

caolanm commented 1 month ago

This was done with https://github.com/CollaboraOnline/online/pull/8556 Automated upload of watchdog profiles for e.g. 24.04 are at https://github.com/caolanm/profiles/tree/co-24.04/watchdog

CollaboraOnline / online

perf + watchdog #8540