Open skinowski opened 6 years ago
This is stats related, if:
https://github.com/fnproject/fn/blob/master/api/agent/drivers/docker/docker.go#L298
collectStats is disabled, containerd CPU usage goes away.
As a side note, these stats (we currently use) are standard cgroup accounting metrics available in /sys/fs/cgroup, so worst case if this issue becomes a priority, we can pull these ourselves.
we have since removed logs from containerd's responsibility list (https://github.com/fnproject/fn/pull/768), I've seen cpu usage of dockerd/containerd drop substantially since then.
stats are still likely responsible for some cpu usage, we should debug it; the way we have things set up, polling stats every 1s won't be very useful for e.g. a function that takes 10ms to run, only 1/100 calls has docker metrics on it. if we grabbed stats more frequently (100/100 times), it would only increase cpu usage. seems weird that reading proc files takes so much cpu, anyway (looked around docker and nothing stuck out here). api to get container stats / logs seems weird, too, but could be useful, maybe we just allow hooking up to our metrics system?
some notes for debugging dockerd cpu usage, have had success doing the following to get pprof logs, they aren't perfectly detailed but the gaps aren't so hard to fill in:
-D
for debug mode$ socat -d -d TCP-LISTEN:8000,fork,bind=0.0.0.0 UNIX:/var/run/docker.sock
$ go tool pprof -raw http://0.0.0.0:8000/debug/pprof/profile
$ go tool pprof -http 0.0.0.0:5000 /usr/bin/dockerd /home/reed/pprof/pprof.dockerd.samples.cpu.003.pb.gz
When fn system is stable (no new containers are spawned nor killed) running hot functions, containerd is observed to take some CPU. In this case, say for 200 containers working on a steady 20 req/sec load, we observe 200 shim processes, 1 containerd process (but with many threads).
Seems like containerd is on the data processing path.
why is containerd taking cpu? due to logs? stats? IO?
If we moved to port/network model instead of stdin/out/err, would this take containerd out of the data flow?