elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
374 stars 111 forks source link

cgroups v2 container.id discovery #523

Open graphaelli opened 2 years ago

graphaelli commented 2 years ago

Is your feature request related to a problem? Please describe. cgroups v2 is increasingly seeing adoption as various distributions have made it the default for containers. As noted in https://github.com/elastic/beats/issues/16958, Fedora 31 (late 2019) enables it by default, Ubuntu 21.10 does as well.

The current spec covers only cgroups v1, this issue is a feature request for v2 support.

Describe the solution you'd like When running applications on systems with cgroups v2 enabled, for example on docker, container.id should be filled in for events produced by APM agents.

Additional context The current metrics spec touches on collecting cgroups v2 metrics without specific guidance on how to identify the cgroup itself, that should be updated as well. The java and python agents may provide insight into the updates required, like consulting /proc/self/mountinfo instead of /proc/self/cgroup when cgroups v2 are detected.

trentm commented 2 years ago

https://stackoverflow.com/questions/68816329/how-to-get-docker-container-id-from-within-the-container-with-cgroup-v2 discusses using upperdir=(.+?) in an entry in /proc/self/mountinfo. That may be limited (my vague, perhaps obsolete, recollection from earlier Docker days was that OverlayFS wasn't always the file driver). It also provides an ID that is different than Docker's container ID.

https://github.com/iovisor/bcc/issues/1119 discusses how there isn't a kernel concept of container ID, so this likely comes down to heuristics specific to each container runtime (docker, k8s, podman, systemd, etc.) ... or just being out of luck if nothing is exposed inside the container.

Gil, you mentioned perhaps having assist from a host-local APM server.

What breaks when a container.id is missing? Can hostname be a (poor) fallback?

graphaelli commented 2 years ago

assist from a host-local APM server.

Good point, I'm not sure how that would work but it is worth considering if, as you wrote, the id is not reliably discoverable from within the container.

What breaks when a container.id is missing?

Workflows based on pivoting on that data are impacted. For example, viewing application service logs either are not shown or scoped only to the host/node level which may (likely!) be running various unrelated containers - sometimes useful, but usually you want to start at container and zoom out to that level if needed. That's a really simple example but I hope it demonstrates that type of issue missing this information causes.

graphaelli commented 2 years ago

Reminded me this is still a problem

image
graphaelli commented 2 years ago

One workaround for those coming across this issue is to start containers with --cgroupns=host - I've confirmed container.id is picked up under cgroupsv2 with docker using that option. That's not available via docker compose yet - tracked in https://github.com/compose-spec/compose-spec/issues/148

Nacoma commented 1 year ago

This is reportedly an issue with at least three of the APM agents so far, with 2/3 waiting for a decision in this thread before taking any action.

What breaks when a container.id is missing?

trentm commented 1 year ago

The current state of the art (StackOverflow, Jenkins, OpenTelemetry JS) seems to be to read and parse /proc/self/mountinfo for the container ID -- as I saw back in Oct 2021.

https://github.com/opencontainers/runtime-spec/issues/1105 seems to be a/the issue to follow for there eventually/possibly being a standardize mechanism for this. Until then, we should update our spec to fallback to parsing /proc/self/mountinfo.

SylvainJuge commented 1 year ago

On OpenTelemetry Java side, cgroups v2 container ID is currently implemented by parsing /proc/self/mountinfo

There is no mention of pod ID however.