CrunchyData / pgnodemx

A PostgreSQL extension that provides SQL functions to allow capture of node OS metrics via SQL queries.
Apache License 2.0
48 stars 11 forks source link

FATAL: pgnodemx: expected 1, got 0, lines from file /sys/fs/cgroup/user.slice/user-1000.slice/session-c3.scope/cgroup.controllers #26

Open df7cb opened 4 days ago

df7cb commented 4 days ago

In Debian's CI environment, the pgnodemx regression tests fail:

### PostgreSQL 17 installcheck ###
Creating new PostgreSQL cluster 17/regress ...
Error: /usr/lib/postgresql/17/bin/pg_ctl /usr/lib/postgresql/17/bin/pg_ctl start -D /tmp/pg_virtualenv.eo7Un1/data/17/regress -l /tmp/pg_virtualenv.eo7Un1/log/postgresql-17-regress.log -c -s -o  -c config_file="/tmp/pg_virtualenv.eo7Un1/postgresql/17/regress/postgresql.conf"  exited with status 1: 
2024-09-14 18:00:33.606 UTC [3233] FATAL:  pgnodemx: expected 1, got 0, lines from file /sys/fs/cgroup/user.slice/user-1000.slice/session-c3.scope/cgroup.controllers
2024-09-14 18:00:33.606 UTC [3233] LOG:  database system is shut down
pg_ctl: could not start server

https://salsa.debian.org/postgresql/pgnodemx/-/jobs/6281060

Perhaps that problem should be a WARNING instead of preventing startup?

keithf4 commented 2 days ago

Is this only happening on 17?

gregscds commented 2 days ago

The most likely situation I've found where this file could be empty is if a Docker environment is setup as Rootless, which seems more likely to be a thing that could have changed most recently rather than Keith's guess that it's PG17. pgnodemx is expecting a line like this:

cpu memory pids

And in a rootless environment, not only can you sometimes not monitor those things, you can not even necessarily see what could be monitored. Our code expects the things to monitor to vary, but no one considered an environment so locked down the list was empty.

I'm not sure if just degrading this to a WARNING gets us to an ideal place. The whole point of pgnodemx is to collect data like this, so if there's nothing there to collect, there's nothing for the program to do. Poking around at what's happening and what some other projects do, there seem a few equally sensible options with good and bad implications:

We should probably provide a simple solution that doesn't punish Christoph for being the once to spot the problem; have our build/package group setup our own Rootless test environment to do further development; and then do the work to document The Right Way for CI testing that packagers should adopt. And if that goes well, then maybe we start removing ways to bypass the testing.

(I hope I'm not wrong about the root cause altogether, because that would mean I just wasted a lot of typing)

df7cb commented 1 day ago

Thanks for the investigation.

In fact, the test never worked before: https://salsa.debian.org/postgresql/pgnodemx/-/pipelines https://salsa.debian.org/postgresql/pgnodemx/-/jobs/5836544 - that's with 16 and the same FATAL.

At the moment the problem isn't critical for me, the CI tests I care about are running on apt.postgresql.org, and there they work. The CI pipeline on salsa.debian.org is just a nice extra to have even more package-related checks running.

What made me write the issue is that it seems to prevent startup. Wouldn't it make more sense to throw an ERROR only when someone queries the stats? Not starting up could be a bad time bomb, perhaps people upgrade the kernel or some kernel settings change, and then on the next reboot or crash, the database suddenly doesn't start anymore.

gregscds commented 7 hours ago

Additional context appreciated. Knowing we're not causing you a serious CI issue is a relief.

I think the question you're right to raise is what about the person who starts their database without the stats there, then someone fixes the problem by granting the right permissions. Shouldn't the stats then start to work? They might not even be able to manually restart their pgnodemx if it gave up and died.

Since the implications of these rootless changes slipped by as something no one ever considered before, I think we need a little design review session that reconsiders error handling for a few of these use cases. Maybe even adjust our idle/sleeping behavior. Thanks for the input, we'll tag the issue when we do something about it.