Podman Metrics for cgroup v2

kramik1 commented 1 year ago

Bug Report

The podman metrics input plugin does not correctly read from cgroups correctly. It attempts to read values that are only in v1 like memory.peak. There is a bit of discussion on this (https://stackoverflow.com/questions/66291245/is-it-possible-to-implement-max-usage-in-bytes-for-cgroup-v2) and it seems like they are adding that value back in but it also attempts to search for rss in memory.stat and memory.max. I am looking at alternatives for my usage (container usage on IoT platforms for telemetry reporting), but since the these values are not exposed in the config I will need to modify the code, hopefully only a little bit. I will report back with what I ended up changing this around to.

To Reproduce I am seeing this both on Ubuntu 22.04 and Yocto with Kernel 5.15 and systemd compiled with unified cgroups.

-- From Ubuntu -- [2023/07/27 17:31:36] [ warn] [input:podman_metrics:podman_metrics.1] Failed to read /sys/fs/cgroup/machine.slice/libpod-1a3d966474de5fd42bca5401dc47217e2fb2889bd1f612ea2eb38ad0b624a775.scope/memory.peak [2023/07/27 17:31:36] [ warn] [input:podman_metrics:podman_metrics.1] rss not found in /sys/fs/cgroup/machine.slice/libpod-1a3d966474de5fd42bca5401dc47217e2fb2889bd1f612ea2eb38ad0b624a775.scope/memory.stat [2023/07/27 17:31:36] [ warn] [input:podman_metrics:podman_metrics.1] Failed to read a number from /sys/fs/cgroup/machine.slice/libpod-1a3d966474de5fd42bca5401dc47217e2fb2889bd1f612ea2eb38ad0b624a775.scope/memory.max [2023/07/27 17:31:36] [ warn] [input:podman_metrics:podman_metrics.1] Failed to read a number from /sys/fs/cgroup/machine.slice/libpod-1a3d966474de5fd42bca5401dc47217e2fb2889bd1f612ea2eb38ad0b624a775.scope/cgroup.procs [2023/07/27 17:31:36] [ warn] [input:podman_metrics:podman_metrics.1] Failed to read /sys/fs/cgroup/machine.slice/libpod-1a3d966474de5fd42bca5401dc47217e2fb2889bd1f612ea2eb38ad0b624a775.scope/containers/cgroup.procs

-- From Yocto -- Yocto has the same for memory.peak, memory.stat and memory.max

kramik1 commented 1 year ago

So for memory.max, if there isn't a configuration set when the container is started it just says 'max'. So that warn might be fine, we could catch that as an info as an unset value. memory.stat has been replaced with a large group of memory data points. For now, I am just going to collect memory.current. I am personally not worried about peak because I am collecting this information in a timeseries database so I can derive that value whenever I want. Not sure why cgroup.procs doesn't have a value yet,

kramik1 commented 1 year ago

I have been working on this for podman with cgroup v2. I removed the recursive directory functions since v2 is much better at having a stable directory structure. I also put some checks in for when containers are no longer running. I am attempting to not included the metrics of stopped containers but it seems that the intentions in the comments are not what is happening in the code. For example on line 291 with cmt_gauge_set, I think the intention is only ever to have one gauge but when a container is turned off, the stale information is continually outputted. It is an array of gauges and it looks to me in cmt_gauge.c line 99 that it always returns a gauge even if one is not already defined for opts and maps because cmt_true is always set. What is best practice with gauges and counters? Should they be created every interval, particularly if I am maintaining the state of running containers? If so, I should probably reorganize the structures such that the metrics are per container and append them at the end of every iteration. It seems to me that a lot of the code is using the heap instead of the stack for a lot of temporary values. The plugin relies on the containers.json list for the initial list of containers on the system which is fine. Each iteration is a clean system check for which containers are currently running. Any feedback would be appreciated. I either need to delete specific gauges based on the containers running or reorganize the code.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 10 months ago

This issue was closed because it has been stalled for 5 days with no activity.

fluent / fluent-bit

Podman Metrics for cgroup v2 #7769

Bug Report