investigate containerd cpu consumption (and perhaps added latency) in a stable system

skinowski commented 6 years ago

When fn system is stable (no new containers are spawned nor killed) running hot functions, containerd is observed to take some CPU. In this case, say for 200 containers working on a steady 20 req/sec load, we observe 200 shim processes, 1 containerd process (but with many threads).

Seems like containerd is on the data processing path.

why is containerd taking cpu? due to logs? stats? IO?

If we moved to port/network model instead of stdin/out/err, would this take containerd out of the data flow?

skinowski commented 6 years ago

This is stats related, if:

https://github.com/fnproject/fn/blob/master/api/agent/drivers/docker/docker.go#L298

collectStats is disabled, containerd CPU usage goes away.

skinowski commented 6 years ago

As a side note, these stats (we currently use) are standard cgroup accounting metrics available in /sys/fs/cgroup, so worst case if this issue becomes a priority, we can pull these ourselves.

rdallman commented 6 years ago

we have since removed logs from containerd's responsibility list (https://github.com/fnproject/fn/pull/768), I've seen cpu usage of dockerd/containerd drop substantially since then.

stats are still likely responsible for some cpu usage, we should debug it; the way we have things set up, polling stats every 1s won't be very useful for e.g. a function that takes 10ms to run, only 1/100 calls has docker metrics on it. if we grabbed stats more frequently (100/100 times), it would only increase cpu usage. seems weird that reading proc files takes so much cpu, anyway (looked around docker and nothing stuck out here). api to get container stats / logs seems weird, too, but could be useful, maybe we just allow hooking up to our metrics system?

some notes for debugging dockerd cpu usage, have had success doing the following to get pprof logs, they aren't perfectly detailed but the gaps aren't so hard to fill in:

run dockerd with -D for debug mode
$ socat -d -d TCP-LISTEN:8000,fork,bind=0.0.0.0 UNIX:/var/run/docker.sock
run rigorous tests and concurrently do next step:
$ go tool pprof -raw http://0.0.0.0:8000/debug/pprof/profile
after ^ - $ go tool pprof -http 0.0.0.0:5000 /usr/bin/dockerd /home/reed/pprof/pprof.dockerd.samples.cpu.003.pb.gz
browse to :5000 and view. Flame graphs also an option with profile file in hand (https://github.com/uber/go-torch)

fnproject / fn

investigate containerd cpu consumption (and perhaps added latency) in a stable system #700