Closed JeroenSoeters closed 1 year ago
Looks good to me. The next steps are going to be exercising this with the container work (yes I know its outstanding... I am working on it!) and seeing how your new-fangled filter works at scale.
Beautiful PR. Well done @JeroenSoeters 🥇
Why
As outlined in this RFC we need to be able to map host PIDs surfaced from eBPF instrumentation to namespaced PIDs so users can make sense of this instrumentation.
What
The bulk of the implemtation for this PR lives in the
ProcCache
. This cache allows us to access information about processes beyond the lifetime of those processes. This is needed in cases where we receive "post-mortem" instrumentation for processes, for example when a process has been oom killed or has received a SIGKILL signal. When we receive such events we cannot reliably look up the namespace PID from/proc
anymore, so we need to make sure we cache this information.The
ProcCache
is an expiring cache, it holds on to process info (right now we only cache the namespace PID) for some amount of time, configurable byevict_at
, afterwards the cache entries will be evicted. Evictions happen upon querying events from the cache, at most once every eviction interval specified byevict_every
. TheProcCache
is hooked up to 2PerfEventBroadcast
streams:ForkedProcess
events by hooking an eBPF tracepoint probe tosched/sched_process_fork
ProcessExit
events by hooking an eBPF kprobe totaskstats_exit
The first stream triggers the creation of new entries in the cache, the second marks cache entries for eviction.
To not clutter the
ObserveService
with filtering and mapping logic in every endpoint a new structObservedEventStream
has been introduced. This is a wrapper aroundPerfEventBroadcast<T>
that has access to theCgroupCache
andProcCache
and therefore supports filtering by workload (via the cgroup path) and mapping of host pids to namespace pids. As long as aPerfEventBroadcast<T>
wraps aT
that implementsHasCgroupId
andHasHostPid
it can be wrapped by anObservedEventStream
and exposed by the observe API. Not allPerfEventBroadcast<T>
can/need to be exposed by the observe API hence the distinction. Some events will only ever be used to manage internal Aurae state, and that is perfectly fine.