aurae-runtime / aurae

Distributed systems runtime daemon written in Rust.
https://aurae.io
Apache License 2.0
1.85k stars 91 forks source link

Mapping host PIDs to namespace PIDs #422

Closed JeroenSoeters closed 1 year ago

JeroenSoeters commented 1 year ago

Why

As outlined in this RFC we need to be able to map host PIDs surfaced from eBPF instrumentation to namespaced PIDs so users can make sense of this instrumentation.

What

The bulk of the implemtation for this PR lives in the ProcCache. This cache allows us to access information about processes beyond the lifetime of those processes. This is needed in cases where we receive "post-mortem" instrumentation for processes, for example when a process has been oom killed or has received a SIGKILL signal. When we receive such events we cannot reliably look up the namespace PID from /proc anymore, so we need to make sure we cache this information.

The ProcCache is an expiring cache, it holds on to process info (right now we only cache the namespace PID) for some amount of time, configurable by evict_at, afterwards the cache entries will be evicted. Evictions happen upon querying events from the cache, at most once every eviction interval specified by evict_every. The ProcCache is hooked up to 2 PerfEventBroadcast streams:

The first stream triggers the creation of new entries in the cache, the second marks cache entries for eviction.

To not clutter the ObserveService with filtering and mapping logic in every endpoint a new struct ObservedEventStream has been introduced. This is a wrapper around PerfEventBroadcast<T> that has access to the CgroupCache and ProcCache and therefore supports filtering by workload (via the cgroup path) and mapping of host pids to namespace pids. As long as a PerfEventBroadcast<T> wraps a T that implements HasCgroupId and HasHostPid it can be wrapped by an ObservedEventStream and exposed by the observe API. Not all PerfEventBroadcast<T> can/need to be exposed by the observe API hence the distinction. Some events will only ever be used to manage internal Aurae state, and that is perfectly fine.

krisnova commented 1 year ago

Looks good to me. The next steps are going to be exercising this with the container work (yes I know its outstanding... I am working on it!) and seeing how your new-fangled filter works at scale.

Beautiful PR. Well done @JeroenSoeters 🥇