bucardo / check_postgres

Nagios check_postgres plugin for checking status of PostgreSQL databases
http://bucardo.org/wiki/Check_postgres
Other
561 stars 175 forks source link

NRPE: Unable to read output #173

Closed Eloar closed 1 year ago

Eloar commented 4 years ago

I've got 2 machines, one with PostgreSQL 10 running (DB), and one with other services including Nagios Core (lets call it just Nagios). I've installed NRPE on DB machine alongside some plugins. Commands configured in nrpe.cnf are like so:

# Misc

command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10
command[check_load]=/usr/lib64/nagios/plugins/check_load -r -w 1.75,1.50,1.00 -c 2.00,1.75,1.50
command[check_hdd]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /dev/vda
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200

# PostgreSQL

command[check_postgres_locks]=/usr/lib64/nagios/plugins/check_postgres_locks -w 2 -c 3
command[check_postgres_bloat]=/usr/lib64/nagios/plugins/check_postgres_bloat -w='100 M' -c='200 M'
command[check_postgres_connection]=/usr/lib64/nagios/plugins/check_postgres_connection --db=postgres
command[check_postgres_backends]=/usr/lib64/nagios/plugins/check_postgres_backends -w=70 -c=100

From Nagios Machine I'm able to run any check from misc, but none from PostgreSQL section. Those fails with error:

NRPE: Unable to read output

I've installed nagios-plugin-nrpe on DB machine to eliminate network problem, and even on localhost I am unable to get valid response. Example:

$ /usr/lib64/nagios/plugins/check_nrpe -H localhost -c check_postgres_connection
NRPE: Unable to read output

locally it's fine:

$ /usr/lib64/nagios/plugins/check_postgres_connection --db=postgres
POSTGRES_CONNECTION OK: DB "postgres" version 10.3 | time=0.01s

and other check works just fine over NRPE:

$ /usr/lib64/nagios/plugins/check_nrpe -H 127.0.0.1 -c check_load
OK - load average per CPU: 0.01, 0.07, 0.07|load1=0.005;1.750;2.000;0; load5=0.075;1.500;1.750;0; load15=0.065;1.000;1.500;0;

I had trouble with connection timeout when 5666 port was not added to iptables as allowed, but that is not the case this time.

NRPE is configured to run as user nrpe and group nagios, where permissions on all files (symlinks included) in /usr/lib64/nagios/plugins are set to 755 and ownership root:nagios. I've installed check_postgres version 2.23.0.

Any ideas how to approach this problem? I've stumbled accross this plugin because of this tutorial.

ubellavance commented 4 years ago

Have you created the nrpe user in your Postgresql cluster? Also, did you configure pg_hba.conf?

Eloar commented 4 years ago

nrpe and nagios users and groups are created when nrpe is installed using yum. pg_hba.conf was modified. Without pg_hba.conf configuration check would not work locally. Check works fine, but only local on DB machine. It doesn't work when invoked over NRPE. Other checks works fine, either local and over network from Nagios Machine.

ubellavance commented 4 years ago

You are talking about the OS user and group. I'm talking about the postgresql user. Can you show me your pg_hba.conf file?

ubellavance commented 4 years ago

I am pretty confident I can help you, I did that before going on vacation.

Eloar commented 4 years ago

It should not matter. It gave me a clue, so I added to each command user parameter: -u postgres. Unfortunatelly result is the same

NRPE: Unable to read output

the most important part in pg_hba.conf is:

local    all    all    trust
ubellavance commented 4 years ago

I'm saying that from the top of my head, as I'm still on vacation, but I think you can try -h localhost.

Eloar commented 4 years ago

-h is option for help -H is for host, but no host is resolved to UNIX Socket and is equivalent to psql without host. I don't want to open access to postgres user for anything other than Unix socket

ubellavance commented 4 years ago

Makes sense, sorry. I'll try to VPN into the office in the next few days to give you details about my setup. Have you tried running the command directly from the shell as nrpe user? You may have to temporarily enable shell login by changing the shell for the nrpe user in /etc/passwd

Eloar commented 4 years ago

yes, just tried it again and it works. I mean direct execution of check_postgres action as nrpe user does work

# sudo -u nrpe -- /usr/lib64/nagios/plugins/check_postgres_connection --db=postgres
POSTGRES_CONNECTION OK: DB "postgres" version 10.3 | time=0.01s
ubellavance commented 4 years ago

Here is how I configured my nrpe actions:

command[check_postgres_action]=/usr/bin/check_postgres.pl -u nrpe --action=$ARG1$
command[check_postgres_action_db]=/usr/bin/check_postgres.pl -u nrpe --action=$ARG1$ --db=$ARG2$
command[check_postgres_action_warning_critical]=/usr/bin/check_postgres.pl -u nrpe --action=$ARG1$ -w=$ARG2$ -c=$ARG3$ $ARG4$ $ARG5$

And i call them this way from Nagios, for example:

/usr/lib/nagios/plugins/check_nrpe -H atqatld1 -c check_postgres_action -a backends

I have an nrpe user created on the postgres cluster and here is the relevant line of my pg_hba.conf file:

local   all     nrpe            peer

It is somewhat the equivalent of your trust but in your case I think that you allow any OS user to access any database in the PostgreSQL cluster, which is a bit risky.

Did you check for SELinux AVCs? What OS are you running? I had to create an SELinux module to make it work on RHEL 8.

Eloar commented 4 years ago

My pg_hba.conf line allows any local user to access any db user. It is bit risky but not that much. I'm afraid user nrpe on Postgres will need to have its access elevated.

I'm using CentOS 7, with SELinux enabled, but it is configured ok, as other checks work fine both local and over nrpe.

EDIT: I've modified my environment configuration to similar to Yours.

  1. added role nrpe CREATE ROLE nrpe WITH LOGIN;
  2. modified pg_hba.conf local all all peer
  3. removed user option from nrpe.cfg
    command[check_postgres_locks]=/usr/lib64/nagios/plugins/check_postgres_locks -w 2 -c 3
    command[check_postgres_bloat]=/usr/lib64/nagios/plugins/check_postgres_bloat -w='100 M' -c='200 M'
    command[check_postgres_connection]=/usr/lib64/nagios/plugins/check_postgres_connection --db=postgres
    command[check_postgres_backends]=/usr/lib64/nagios/plugins/check_postgres_backends
  4. tried locally
    # sudo -u nrpe -- /usr/lib64/nagios/plugins/check_postgres_connection -u nrpe --db=postgres
    POSTGRES_CONNECTION OK: DB "postgres" version 10.3 | time=0.04s
  5. tried over nrpe (locally)
    # /usr/lib64/nagios/plugins/check_nrpe -H 127.0.0.1 -c check_postgres_connection
    NRPE: Unable to read output

    Unfortunatelly problem is not resolved

ubellavance commented 4 years ago

I think you need the -u option. And I doubt SELinux's default configuration works for those checks. Do you log connections in your postgresql logs? Do you see connections attempts in the logs when executing via nrpe?

Eloar commented 4 years ago

I thought SELinux was a good clue, so I've checked files contexts in /usr/lib64/nagios/plugins. Plugins installed with yum had context system_u:object_r:nagios_unconfined_plugin_exec_t:s0 where check_postgres.pl and it's symlinks had context unconfined_u:object_r:lib_t:s0. Unfortunatelly changing it to nagios_unconfined_plugin_exec_t did not resolve issue.

ubellavance commented 4 years ago

Did you check if you had any avc entries in /var/log/audit/audit.log?

Eloar commented 4 years ago

I'm back on it after weekend. Unfortunatelly I've found nothing related to nrpe, nagios or postgres in /var/log/audit/audit.log.

ubellavance commented 4 years ago
sudo -i
/path/to/check_postgres args
Eloar commented 4 years ago

I don't use sudo for NRPE execution. I've used sudo for debugging check_postgres if it would run properly under user nrpe permissions. NRPE process runs as nrpe user.

I've found something like that in syslog:

sie 10 14:07:05 dev-db nrpe[29748]: CONN_CHECK_PEER: checking if host is allowed: DEV-NAGIOS port 41682
sie 10 14:07:05 dev-db nrpe[29748]: Connection from DEV-NAGIOS port 41682
sie 10 14:07:05 dev-db nrpe[29748]: is_an_allowed_host (AF_INET): is host >DEV-NAGIOS< an allowed host >DEV-NAGIOS<
sie 10 14:07:05 dev-db nrpe[29748]: is_an_allowed_host (AF_INET): is host >DEV-NAGIOS< an allowed host >DEV-NAGIOS<
sie 10 14:07:05 dev-db nrpe[29748]: is_an_allowed_host (AF_INET): host is in allowed host list!
sie 10 14:07:05 dev-db nrpe[29748]: Host address is in allowed_hosts
sie 10 14:07:05 dev-db nrpe[29748]: Host DEV-NAGIOS is asking for command 'check_postgres_backends' to be run...
sie 10 14:07:05 dev-db nrpe[29748]: Running command: /usr/lib64/nagios/plugins/check_postgres_backends --user=nrpe -w=70 -c=100
sie 10 14:07:05 dev-db nrpe[29749]: WARNING: my_system() seteuid(0): Operation not permitted
sie 10 14:07:05 dev-db nrpe[29749]: Warning: Could not set effective GID=999
sie 10 14:07:05 dev-db nrpe[29748]: Command completed with return code 3 and output:
sie 10 14:07:05 dev-db nrpe[29748]: Return Code: 3, Output: NRPE: Unable to read output
sie 10 14:07:05 dev-db nrpe[29748]: Connection from DEV-NAGIOS closed.
ubellavance commented 4 years ago

Ok, you have hints now. What about the first point of my last comment?

Eloar commented 4 years ago

About first hint I've enabled logging connections and disconnections in postgresql. I got effective connection when running check_postgres.pl --action connection directly but not when trying to run it over NRPE. In journalct I've got of valid check:

sie 10 16:38:32 dev-db nrpe[4393]: CONN_CHECK_PEER: checking if host is allowed: DEV-NAGIOS port 8919
sie 10 16:38:32 dev-db nrpe[4393]: Connection from DEV-NAGIOS port 8919
sie 10 16:38:32 dev-db nrpe[4393]: is_an_allowed_host (AF_INET): is host >DEV-NAGIOS< an allowed host >DEV-NAGIOS<
sie 10 16:38:32 dev-db nrpe[4393]: is_an_allowed_host (AF_INET): is host >DEV-NAGIOS< an allowed host >DEV-NAGIOS<
sie 10 16:38:32 dev-db nrpe[4393]: is_an_allowed_host (AF_INET): host is in allowed host list!
sie 10 16:38:32 dev-db nrpe[4393]: Host address is in allowed_hosts
sie 10 16:38:32 dev-db nrpe[4393]: Host DEV-NAGIOS is asking for command 'check_total_procs' to be run...
sie 10 16:38:32 dev-db nrpe[4393]: Running command: /usr/lib64/nagios/plugins/check_procs -w 150 -c 200
sie 10 16:38:32 dev-db nrpe[4394]: WARNING: my_system() seteuid(0): Operation not permitted
sie 10 16:38:32 dev-db nrpe[4394]: Warning: Could not set effective GID=999
sie 10 16:38:32 dev-db nrpe[4393]: Command completed with return code 0 and output: PROCS OK: 80 processes | procs=80;150;200;0;
sie 10 16:38:32 dev-db nrpe[4393]: Return Code: 0, Output: PROCS OK: 80 processes | procs=80;150;200;0;
sie 10 16:38:32 dev-db nrpe[4393]: Connection from DEV-NAGIOS closed.

For check_postgres.pl I've got:

sie 10 16:33:50 dev-db nrpe[4185]: CONN_CHECK_PEER: checking if host is allowed: DEV-NAGIOS port 64982
sie 10 16:33:50 dev-db nrpe[4185]: Connection from DEV-NAGIOS port 64982
sie 10 16:33:50 dev-db nrpe[4185]: is_an_allowed_host (AF_INET): is host >DEV-NAGIOS< an allowed host >DEV-NAGIOS<
sie 10 16:33:50 dev-db nrpe[4185]: is_an_allowed_host (AF_INET): is host >DEV-NAGIOS< an allowed host >DEV-NAGIOS<
sie 10 16:33:50 dev-db nrpe[4185]: is_an_allowed_host (AF_INET): host is in allowed host list!
sie 10 16:33:50 dev-db nrpe[4185]: Host address is in allowed_hosts
sie 10 16:33:50 dev-db nrpe[4185]: Host DEV-NAGIOS is asking for command 'check_postgres_locks' to be run...
sie 10 16:33:50 dev-db nrpe[4185]: Running command: /usr/lib64/nagios/plugins/check_postgres.pl --action locks --user=nrpe -w 2 -c 3
sie 10 16:33:50 dev-db nrpe[4186]: WARNING: my_system() seteuid(0): Operation not permitted
sie 10 16:33:50 dev-db nrpe[4186]: Warning: Could not set effective GID=999
sie 10 16:33:50 dev-db nrpe[4185]: Command completed with return code 2 and output:
sie 10 16:33:50 dev-db nrpe[4185]: Return Code: 3, Output: NRPE: Unable to read output
sie 10 16:33:50 dev-db nrpe[4185]: Connection from DEV-NAGIOS closed.

apperently when run by nrpe process command check_postges.pl ends with status 2 and outputs nothing and doesn't even connects to local Postgres DB.

When using check_postgres there is in fact problem with SELinux. When installed to /usr/lib64/nagios/plugins it gets context: unconfined_u:object_r:lib_t or unconfined_u:object_r:unconfined_t. It needs file context system_u:object_r:nagios_unconfined_plugin_exec_t to be run over NRPE. Somehow symlinks to check_postgres.pl can't get proper system_u setting so I've swithced from using symlinks to usage --action option. This way there is no denial in audit, yet it does not work over NRPE.

ubellavance commented 4 years ago

I suggest you try in permissive mode temporarily. I remember having denials without logs on RHEL 7.

Eloar commented 4 years ago

I'm not keen to do so, but tried switching selinux on DB machine into permissive mode

# sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Max kernel policy version:      28
# setenforce 0
# getenforce
Permissive
-bash-4.2# sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   permissive
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Max kernel policy version:      28

then I've run check_nrpe on Nagios machine without success. Both output on Nagios machine and log on DB machine are same in enforcing and permissive mode.

ubellavance commented 4 years ago

I'm out of ideas, sorry

Eloar commented 4 years ago

Long story short I'm kind of an idiot.

So somewhere along way SELinux was a problem due to invalid file context. Whatever I've done I couldn't keep proper file context for symlinks, so I had to use check_postgres.pl directly with option --action.

And the reason for error was invalid option for check_postgres.pl. When switched from symlinks to running check_postgres.pl directly and using option --action I've switched from short option -u to long one but made mistake. I've wrote --user instead --dbuser. Apperently NRPE daemon doesn't log std_err from invoked command. I was able to catch it after suggestion to add 2>&1 to end of command definition in nrpe.cfg file on monitored remote system.

So conclusion is to try add 2>&1 to the and of command definition in nagios.cfg during debugging.

turnstep commented 1 year ago

Glad it all worked out! That was a tricky one to debug.