NagiosEnterprises / ncpa

Nagios Cross-Platform Agent
Other
177 stars 94 forks source link

return code -11...this was working...then stopped #925

Open tonyguadagno opened 1 year ago

tonyguadagno commented 1 year ago

hi, i have ncpa client ( 2.4.1-1.el8) installed on my RHEL 8 (patch current) servers. on all of my servers, i am running a powershell script that checks php-fpm. i use a shell script to call the powershell script like so:

#!/usr/bin/env -S /usr/bin/pwsh -NoProfile $fqdn = [System.Net.Dns]::GetHostEntry([string]$env:computername).HostName /usr/local/bin/ps_PHPFPMStatusCheck.ps1 -statusurl http://$fqdn/php-fpm-status exit $LASTEXITCODE initially, this was working on all my servers. then, one stopped working, now a second has stopped. in both cases, ncpa returns this:

{ "returncode": -11, "stdout": "" }

i have logging at debug but not much is given:

2023-03-14 09:19:15,689 698786 DEBUG Initializing WebSocket 2023-03-14 09:19:15,689 698786 DEBUG Validating WebSocket request 2023-03-14 09:19:15,693 698786 DEBUG Running process with command line:/usr/local/ncpa/plugins/phpstatus 2023-03-14 09:19:16,182 698786 INFO ::ffff:172.19.5.8 - - [2023-03-14 09:19:16] "GET /api/plugins/phpstatus/?token=zIVIT5mI4FJQq9_eyJhG&check=1 HTTP/1.1" 200 262 0.492494

i am able to run the script locally using the nagios user like so:

su nagios [nagios@server log]$ /usr/local/ncpa/plugins/phpstatus OK: Pool Name: www, ProcMGR: static, Total Procs: 100, IdleProcs: 58, MaxChildReached: 0, SlowReq: 0, ActiveProcs: 42 |ProcessUtilization=42%

i know that something has changed on these servers, but i cannot see what that might be. also, i have not made any changes to the script etc.

can anyone tell me what these error codes mean? -11...what does that mean?

thanks

MrPippin66 commented 1 year ago

Errno -11 if using standard errno.h values means "EAGAIN", which generally means a system call was interrupted.

That ASSUMES what the app returned was a standard errno.h standard value.

That also assumes this is coming from the actual command and not an internal error from NCPA itself.

tonyguadagno commented 1 year ago

ug, EAGAIN really doesn't help me troubleshoot this. do you have any suggestions on how to get more info from ncpa....debug logging was not very helpful.

thanks for you time

MrPippin66 commented 1 year ago

Have you validated the -11 return code is actually coming from your powershell script?

That should be somewhat easy to validate, since you state you're expressly calling it from a shell script.

tonyguadagno commented 1 year ago

well, i modified my wrapper script like so:

`cat phpstatus

!/usr/bin/env -S /usr/bin/pwsh -NoProfile

$fqdn = [System.Net.Dns]::GetHostEntry([string]$env:computername).HostName /usr/local/bin/ps_PHPFPMStatusCheck.ps1 -statusurl http://$fqdn/php-fpm-status write-host $LASTEXITCODE exit $LASTEXITCODE `

when i run it from the command line as nagios user, i get this:

id uid=977(nagios) gid=974(nagios) groups=974(nagios) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [nagios@ffcslwp1 plugins]$ ./phpstatus OK: Pool Name: www, ProcMGR: static, Total Procs: 100, IdleProcs: 81, MaxChildReached: 0, SlowReq: 0, ActiveProcs: 19 |ProcessUtilization=19% 0

as you can see, 0 is returned but, when i call it by the api, i get the same error. using the api, i don't see the "0" but i am not sure i would...right? i use this command locally on the server:

wget --no-check-certificate -O /tmp/result.txt "https://172.19.5.8:5693/api/plugins/phpstatus/?token=mytoken&check=1"

and get this result file:

cat /tmp/result.txt { "returncode": -11, "stdout": ""

tonyguadagno commented 1 year ago

whats really frustrating is that i can still do this exact same thing on one of my other servers....they should be identical...but clearly there is some difference

`wget --no-check-certificate -O /tmp/result.txt "https://IP:5693/api/plugins/phpstatus/?token=token&check=1" --2023-03-14 14:56:38-- https://IP:5693/api/plugins/phpstatus/?token=token&check=1 Connecting to IP:5693... connected. WARNING: The certificate of ‘IP’ is not trusted. WARNING: The certificate of ‘IP’ hasn't got a known issuer. The certificate's owner does not match hostname ‘IP’ HTTP request sent, awaiting response... 200 OK Length: 197 [application/json] Saving to: ‘/tmp/result.txt’

/tmp/result.txt 100%[==================================================================>] 197 --.-KB/s in 0.04s

2023-03-14 14:56:39 (4.79 KB/s) - ‘/tmp/result.txt’ saved [197/197]

[root@host php-fpm.d]# cat /tmp/result.txt { "returncode": 0, "stdout": "OK: Pool Name: www, ProcMGR: static, Total Procs: 300, IdleProcs: 293, MaxChildReached: 0, SlowReq: 0, ActiveProcs: 7 |ProcessUtilization=2.33333333333333%" `

ccztux commented 1 year ago

The NCPA version and ncpa.cfg is identical an those servers? Just a shot in the dark: Is SELINUX active on the affected server?

tonyguadagno commented 1 year ago

the version is the same. I had thought about selinux. i ran a sealert -a. nothing stood out, although there were some entries for python stuff? like this:

SELinux is preventing /usr/libexec/platform-python3.6 from write access on the file _etc_vdoconf.yml.lock.

*****  Plugin catchall (100. confidence) suggests   **************************

If you believe that platform-python3.6 should be allowed write access on the _etc_vdoconf.yml.lock file by default.
Then you should report this as a bug.
You can generate a local policy module to allow this access.
Do
allow this access for now by executing:
ausearch -c 'vdo' --raw | audit2allow -M my-vdo
semodule -X 300 -i my-vdo.pp

Additional Information:
Source Context                system_u:system_r:insights_client_t:s0
Target Context                system_u:object_r:var_lock_t:s0
Target Objects                _etc_vdoconf.yml.lock [ file ]
Source                        vdo
Source Path                   /usr/libexec/platform-python3.6
Port                          <Unknown>
Host                          <Unknown>
Source RPM Packages           platform-python-3.6.8-48.el8_7.1.x86_64
Target RPM Packages
SELinux Policy RPM            selinux-policy-targeted-3.14.3-108.el8_7.1.noarch
Local Policy RPM              selinux-policy-targeted-3.14.3-108.el8_7.1.noarch

and this

SELinux is preventing /usr/libexec/platform-python3.6 from write access on the directory /var/lib/selinux/targeted/active/modules.

*****  Plugin selinuxpolicy (91.4 confidence) suggests   *********************

If you do not think platform-python3.6 should try write access on modules.
Then you may be under attack by a hacker, since confined applications should not need this access.
Do
contact your security administrator and report this issue.

*****  Plugin catchall (9.59 confidence) suggests   **************************

If you believe that platform-python3.6 should be allowed write access on the modules directory by default.
Then you should report this as a bug.
You can generate a local policy module to allow this access.
Do
allow this access for now by executing:
ausearch -c 'semanage' --raw | audit2allow -M my-semanage
semodule -X 300 -i my-semanage.pp

Additional Information:
Source Context                system_u:system_r:insights_client_t:s0
Target Context                system_u:object_r:semanage_store_t:s0
Target Objects                /var/lib/selinux/targeted/active/modules [ dir ]
Source                        semanage
tonyguadagno commented 1 year ago

fyi, for testing, i set selinux to passive and still had the issue...so i don't think it is selinux

sudo sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   permissive
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      33
ccztux commented 1 year ago

Thanks for testing it.

tonyguadagno commented 1 year ago

ok, i have some more info...but i don't know how to interpret this: on the server that does not work:

Mar 15 15:37:22 server powershell[765022]: (7.3.3:1:80) [Perftrack_ConsoleStartupStart:PowershellConsoleStartup.WinStart.Informational] PowerShell console is starting up
Mar 15 15:37:22 server powershell[765022]: (7.3.3:B:80) [NamedPipeIPC_ServerListenerStarted:NamedPipe.Open.Informational] PowerShell has started an IPC listening thread on process: 765022 in AppDomain: None.
Mar 15 15:37:22 server powershell[765022]: (7.3.3:1:80) [Perftrack_ConsoleStartupStop:PowershellConsoleStartup.WinStop.Informational] PowerShell console is ready for user input
Mar 15 15:37:22 server kernel: traps: pwsh[765033] general protection fault ip:7f8db5d34929 sp:7f4c8b4d54c8 error:0 in libcoreclr.so[7f8db5c32000+4e0000]
Mar 15 15:37:22 server systemd[1]: Started Process Core Dump (PID 765042/UID 0).
Mar 15 15:37:22 server systemd-coredump[765043]: Resource limits disable core dumping for process 765022 (pwsh).
Mar 15 15:37:22 server systemd-coredump[765043]: Process 765022 (pwsh) of user 977 dumped core.
Mar 15 15:37:22 server systemd[1]: systemd-coredump@32-765042-0.service: Succeeded.

on the server that works:

Mar 15 15:38:17 server powershell[1758803]: (7.3.3:1:80) [Perftrack_ConsoleStartupStart:PowershellConsoleStartup.WinStart.Informational] PowerShell console is starting up
Mar 15 15:38:17 server powershell[1758803]: (7.3.3:B:80) [NamedPipeIPC_ServerListenerStarted:NamedPipe.Open.Informational] PowerShell has started an IPC listening thread on process: 1758803 in AppDomain: None.
Mar 15 15:38:17 server powershell[1758803]: (7.3.3:1:80) [Perftrack_ConsoleStartupStop:PowershellConsoleStartup.WinStop.Informational] PowerShell console is ready for user input

looks like powershell core dumps but only when called from the API, not from the commandline logged in as nagios. FYI, user 977 is the nagios user

tonyguadagno commented 1 year ago

not sure if this helps any, but i changed my wrapper script to be a bash script. like this

#!/usr/bin/bash
/usr/bin/pwsh -file /usr/local/bin/ps_PHPFPMStatusCheck.ps1 -statusuri /php-fpm-status
exit $?

running this from command line works find. but when ncpa agent is told to call it via api call like this:

wget --no-check-certificate -O /tmp/result1.txt "https://ipaddr:5693/api/plugins/phpstatus/?token=mytoken&check=1"

this happens:

Mar 16 10:34:08 server powershell[806740]: (7.3.3:1:80) [Perftrack_ConsoleStartupStart:PowershellConsoleStartup.WinStart.Informational] PowerShell console is starting up
Mar 16 10:34:08 server powershell[806740]: (7.3.3:B:80) [NamedPipeIPC_ServerListenerStarted:NamedPipe.Open.Informational] PowerShell has started an IPC listening thread on process: 806740 in AppDomain: None.
Mar 16 10:34:08 server powershell[806740]: (7.3.3:1:80) [Perftrack_ConsoleStartupStop:PowershellConsoleStartup.WinStop.Informational] PowerShell console is ready for user input
Mar 16 10:34:08 server kernel: traps: pwsh[806750] general protection fault ip:7f038d1d1929 sp:7ec262d824c8 error:0 in libcoreclr.so[7f038d0cf000+4e0000]
Mar 16 10:34:08 server systemd[1]: Started Process Core Dump (PID 806759/UID 0).
Mar 16 10:34:09 server systemd-coredump[806760]: Resource limits disable core dumping for process 806740 (pwsh).
Mar 16 10:34:09 server systemd-coredump[806760]: Process 806740 (pwsh) of user 977 dumped core.
Mar 16 10:34:09 server systemd[1]: systemd-coredump@38-806759-0.service: Succeeded.
MrPippin66 commented 1 year ago

You'll need to alter the resource limits for core dumping for the ID you have this running as on that server ( I assume 'nagios') to allow a full core dump. You'll likely need to pursue the cause of the core with PowerShell core development.

Have you updated the PowerShell core version you're using to be the current version? They'll likely ask that, first.

tonyguadagno commented 1 year ago

hi, yes, powershell is current 7.3.3. can you explain what is different between having the nagios user run the script vs the api call the script? that might shed some light on what is going on.

MrPippin66 commented 1 year ago

That's hard to completely qualify as an answer.

From a command line invocation, it's running from a user session, rather than as from a daemon, which is how NCPA would call it.

Again, best bet is examine a full core.

You'd LIKELY get the same results if you ran this from an "at" job as the afore mentioned ID.