Make /proc/pid/io lookup failures skippable

fearful-symmetry commented 4 months ago

What does this PR do?

This fixes a change in the behavior of the system metrics where fetching process metrics on a container without SYS_PTRACE results in a failure, as /proc/pid/io requires SYS_PTRACE.

This flies under the radar in a handful of use cases, as our docs mention SYS_PTRACE in other contexts, and many users monitoring system data will (in my experience) end up passing --privileged due to a variety of issues with permissions in proc. This issue also doesn't appear outside of docker.

There is a test already that does cover this (although it'll catch it for a permissions error, not a container capability error): TestFetchProcessFromOtherUser, however, that test is usually skipped in CI, as the CI environment lacks more than one user. The exact test condition itself can't really be tested without running in a container.

If we want to test this more thoroughly, we'll need some kind of integration test in beats that runs in a container and specifically looks out for skipped processes. These kinds of tests are often flaky, so they tend not to happen. Once we get buildkite set up in the beats repo, we should add a test that specifically looks for skipped metrics, probably by spinning up multiple sub-processes and having beats monitor them directly.

Why is it important?

This changes the behavior of system metrics inside a container.

Checklist

[x] My code follows the style guidelines of this project
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] I have added an entry in CHANGELOG.md

cmacknz commented 4 months ago

If we want to test this more thoroughly, we'll need some kind of integration test in beats that runs in a container and specifically looks out for skipped processes

The buildkite agent is itself a container isn't it? Can it run an ubuntu container for us here? Can we just test this directly in CI in this repository?

fearful-symmetry commented 4 months ago

@cmacknz so, to clarify, we need to run it in a container where we're trying to monitor the host OS. In order to test this specific issue in a non-flaky way, we'd need to fetch a running process from outside the container, and attempt to read that. Not sure how possible that is with our current CI configs. The next-best way is to just monitor a process that doesn't belong to the same user the running beat instance, which is (slightly) easier in theory, but a lot of the CI environments are isolated enough that we can only see PIDs from the currently running test user.

fearful-symmetry commented 4 months ago

In the past we've occasionally run into adjacent problems testing the intersection of cgroups+various container configs. It's not enough to run the test in a container, the test needs to have control of the container orchestration, so it can mount in host file systems, alter container security options, have knowledge of the host OS, etc.

cmacknz commented 4 months ago

@cmacknz so, to clarify, we need to run it in a container where we're trying to monitor the host OS. In order to test this specific issue in a non-flaky way, we'd need to fetch a running process from outside the container, and attempt to read that. Not sure how possible that is with our current CI configs.

CI does not want you to escape the container, so not very possible. At least not without a dedicated CI running specifically for this purpose.

The next-best way is to just monitor a process that doesn't belong to the same user the running beat instance, which is (slightly) easier in theory, but a lot of the CI environments are isolated enough that we can only see PIDs from the currently running test user.

Can you try this and confirm?

It's not enough to run the test in a container, the test needs to have control of the container orchestration, so it can mount in host file systems, alter container security options, have knowledge of the host OS, etc.

Is there anything we can do with https://pkg.go.dev/io/fs to mock out the filesystem conditions we want to observe? Or by just capturing representative content from /proc and point the code at that?

Do we really need live running processes for this? Or can we create a reference instance of the part of the /proc tree we care about instead?

I am wary of declaring system metrics as untestable considering our recent experience with changes in here have unintended severe consequences.

fearful-symmetry commented 4 months ago

@cmacknz

Can you try this and confirm?

Already done in a previous PR, that's why the TestFetchProcessFromOtherUser has a Skip condition, since it would fail in a PR, as we can't see any processes from other users. I assume that getting this to work would require some more sophisticated setup process; we'd need to either create or run some helper process as a different user (assuming the container environment lets us) and then test against that helper process.

Or by just capturing representative content from /proc and point the code at that?

So, it depends on what we're trying to test. If we just want to test the more basic condition of "how does this library behave when it runs into some kind of permission failure" we can test that, but the more nefarious set of errors here are "does metricbeat behave as expected when monitoring the host system from inside a container" that's a little harder. This isn't a normal file permission error, as the man page explains, "Permission to access this file is governed by a ptrace access mode PTRACE_MODE_READ_FSCREDS check". This is really only relevant to docker, where it runs containers in a more locked-down permissions context, which means that a container needs --privileged or --cap-add sys_ptrace in addition to root. I don't think there's a good way to really emulate this exact behavior?

I do believe we should have some kind of CI or testing platform that's a little more low-level. In addition to issues like this, where it would help to spin up our own containers, there's a lot of edge cases that would benefit from a test matrix that allows for more fine-grained control over the specific OS. But that's a separate thing.

I'm gonna poke around and see if the test environment lets me make a user and start a process as it.

cmacknz commented 4 months ago

This is really only relevant to docker, where it runs containers in a more locked-down permissions context, which means that a container needs --privileged or --cap-add sys_ptrace in addition to root. I don't think there's a good way to really emulate this exact behavior?

The buildkite docker plugin supports --privileged for build steps: https://github.com/buildkite-plugins/docker-buildkite-plugin?tab=readme-ov-file#privileged-optional-boolean

We should be able to try creating a build step with the docker plugin, and if it doesn't work immediately I would ask eng prod if it can be setup.

It should allow privileged access to the underlying buildkite agent, which is what you'd have if you just ran on it directly, which seems reasonable.

leehinman commented 4 months ago

I don't think there's a good way to really emulate this exact behavior?

Could we maybe use bpf_override_return in ebpf to force a return value to test it? Not sure if we can use ebpf in CI though.

fearful-symmetry commented 4 months ago

I don't think there's a good way to really emulate this exact behavior?

Could we maybe use bpf_override_return in ebpf to force a return value to test it? Not sure if we can use ebpf in CI though.

Oh, that's a fun idea, I like that. No idea how practical it is.

Much to my surprise, the CI environment lets me make a new user. I've started trying to build out some kind of test that'll allow us to run the system code against a process that doesn't belong to the current user.

fearful-symmetry commented 4 months ago

So, it's still kind of a hack, but I've come up with a test that checks the base scenario: can we read from a process where we don't have sufficient without failing? This is a broader test than what actually spurred this issue, but I'd argue this is something of a positive, as it means we're testing for changes that might cause issues when beats is run as non-root.

elastic / elastic-agent-system-metrics