Open df7cb opened 4 days ago
Is this only happening on 17?
The most likely situation I've found where this file could be empty is if a Docker environment is setup as Rootless, which seems more likely to be a thing that could have changed most recently rather than Keith's guess that it's PG17. pgnodemx is expecting a line like this:
cpu memory pids
And in a rootless environment, not only can you sometimes not monitor those things, you can not even necessarily see what could be monitored. Our code expects the things to monitor to vary, but no one considered an environment so locked down the list was empty.
I'm not sure if just degrading this to a WARNING gets us to an ideal place. The whole point of pgnodemx is to collect data like this, so if there's nothing there to collect, there's nothing for the program to do. Poking around at what's happening and what some other projects do, there seem a few equally sensible options with good and bad implications:
Drop the error to a WARNING, just don't really do anything, and trust the program will be reaped by the container cleanup. That will let cases like the heavily locked down Debian CI environment validate the build works and no one has to do anything else. But CI systems will have to note the warning to even begin to realize that nothing was actually tested, and those are not programs whose output is typically reviewed to that level.
Exit with some comment when this happens but don't throw an error code, just point out the file is empty. That will get types of some CI tests to pass, others will note the program exited immediately and at least realize something is off. Like just warning, no one will still actually be testing anything, but that the program built and only started momentarily should flag more people to notice the problem exists via the quick exit.
Disable this part of CI testing for the module to explicitly document there's an issue. That will require everyone doing builds to individually do testing adjustments to their respective build scripts, but at least no one will be surprised about whether testing does or doesn't happen. For example, in the similar situation at https://github.com/containers/podman/pull/11935 I see "we have disabled this test in the CI because it is difficult to know what controllers are going to be enabled for rootless under all conditions we test"; that rings true. I don't think it's necessarily in anyone's best interests to force that level of policy today, on everyone, just to package our relatively simple program.
Document the problem and at least one simple solution, which seems to be delegating enough rights to the environment that pgnodemx finds one entry it can check it out; see https://rootlesscontaine.rs/getting-started/common/cgroup2/#enabling-cpu-cpuset-and-io-delegation for example. This will heavily push packagers toward a real test run of the program. That seems to me an even more inappropriately burdensome approach for us to dictate at this point in time.
We should probably provide a simple solution that doesn't punish Christoph for being the once to spot the problem; have our build/package group setup our own Rootless test environment to do further development; and then do the work to document The Right Way for CI testing that packagers should adopt. And if that goes well, then maybe we start removing ways to bypass the testing.
(I hope I'm not wrong about the root cause altogether, because that would mean I just wasted a lot of typing)
Thanks for the investigation.
In fact, the test never worked before: https://salsa.debian.org/postgresql/pgnodemx/-/pipelines https://salsa.debian.org/postgresql/pgnodemx/-/jobs/5836544 - that's with 16 and the same FATAL.
At the moment the problem isn't critical for me, the CI tests I care about are running on apt.postgresql.org, and there they work. The CI pipeline on salsa.debian.org is just a nice extra to have even more package-related checks running.
What made me write the issue is that it seems to prevent startup. Wouldn't it make more sense to throw an ERROR only when someone queries the stats? Not starting up could be a bad time bomb, perhaps people upgrade the kernel or some kernel settings change, and then on the next reboot or crash, the database suddenly doesn't start anymore.
Additional context appreciated. Knowing we're not causing you a serious CI issue is a relief.
I think the question you're right to raise is what about the person who starts their database without the stats there, then someone fixes the problem by granting the right permissions. Shouldn't the stats then start to work? They might not even be able to manually restart their pgnodemx if it gave up and died.
Since the implications of these rootless changes slipped by as something no one ever considered before, I think we need a little design review session that reconsiders error handling for a few of these use cases. Maybe even adjust our idle/sleeping behavior. Thanks for the input, we'll tag the issue when we do something about it.
In Debian's CI environment, the pgnodemx regression tests fail:
https://salsa.debian.org/postgresql/pgnodemx/-/jobs/6281060
Perhaps that problem should be a WARNING instead of preventing startup?