ConSol-Monitoring / snclient

SNClient+ - Cross platform monitoring agent
MIT License
50 stars 8 forks source link

Script run into timeout #96

Closed alexoalexo closed 7 months ago

alexoalexo commented 7 months ago

I have a number of external scripts that I run and all are running fine except one of them. I always get a timeout from snclient although the script is successfuly started and finishes prior to reaching the defined timeout.

For example timeout set to 666, script finishes on 1min but I still get: script run into timeout after xxx

[/settings/external scripts] ; timeout - The maximum time in seconds that a command can execute. (if more then this execution will be aborted). ; NOTICE this only affects external commands not internal ones. timeout = 666

`root@nagiosserver:/usr/lib/nagios/plugins# time ./check_nrpe -d 2 -2 -t 360:unknown -H 1.1.1.1 -c generate_report UNKNOWN - script run into timeout after 666s

real 1m0.479s user 0m0.006s sys 0m0.004s ` What should be my next troubleshooting steps? Thank you

alexoalexo commented 7 months ago

Here is my log file (level set to trace):

_[2024-02-27 13:06:54.429][Trace][pid:1192][listen_nrpe:105] nrpe v2 request: generate_rvtools_report []string{} [2024-02-27 13:06:54.432][Trace][pid:1192][snclient:706] command: generate_rvtools_report [2024-02-27 13:06:54.432][Trace][pid:1192][snclient:707] args: []string{} [2024-02-27 13:06:54.432][Debug][pid:1192][checkwrap:63] command before macros expanded: RVToolsBatchMultipleVCs.ps1 [2024-02-27 13:06:54.445][Trace][pid:1192][snclient:1327] command object: path: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe args: - SysProcAttr: &{true powershell -WindowStyle hidden -NoLogo -NonInteractive -Command ". 'C:\Program Files\snclient\scripts\RVToolsBatchMultipleVCs.ps1' ; exit($LASTEXITCODE)" 0 0 false [] 0} [2024-02-27 13:07:58.375][Trace][pid:1192][snclient:1052] stdout: [2024-02-27 13:07:58.375][Trace][pid:1192][snclient:1053] stderr: [2024-02-27 13:07:58.375][Trace][pid:1192][snclient:1054] exitCode: 3 [2024-02-27 13:07:58.375][Trace][pid:1192][snclient:1055] timeout: 666 [2024-02-27 13:07:58.375][Trace][pid:1192][snclient:1056] error: &fmt.wrapError{msg:"timeout: context deadline exceeded", err:context.deadlineExceededError{}} [2024-02-27 13:07:58.375][Trace][pid:1192][snclient:1057] state: &os.ProcessState{pid:4556, status:syscall.WaitStatus{ExitCode:0x1}, rusage:(*syscall.Rusage)(0xc00018c560)} [2024-02-27 13:07:58.375][Debug][pid:1192][listener:326] nrpe connection from 1.2.3.4:51318 finished in 1m3.9885392s

It looks like my script "RVToolsBatchMultipleVCs.ps1" exits with code 3? But why snclient is sending back such message?

Thanks

alexoalexo commented 7 months ago

When I run my script manually it always exits with 0. It has 'exit 0' at the end. So I don't know I get error code 3?

sni commented 7 months ago

which snclient version is this? I have reworked the scripts/wrapped scripts part last week. Could you try the current build from https://github.com/ConSol-Monitoring/snclient/actions/runs/8052511606 (you need to scroll down to the bottom of that page)

alexoalexo commented 7 months ago

I'm running SNClient+ v0.18 (Build: 706454c, go1.21.7)

alexoalexo commented 7 months ago

Same thing with SNClient+ v0.18.0046 (Build: e3bc0fd, go1.21.7). I get error 3 I'm running it as .ps1 file and also calling it from a a .bat file and always get exit code 3

[2024-02-27 14:08:12.729][Trace][pid:3708][snclient:1066] error: &fmt.wrapError{msg:"timeout: context deadline exceeded", err:context.deadlineExceededError{}}

sni commented 7 months ago

i see, so it seems like the timeout isn't used at all and the default timeout is used instead?

alexoalexo commented 7 months ago

When the script finishes quickly - it returns 0 and all is fine. Can you revisit the timeout settings?

alexoalexo commented 7 months ago

When it takes more than a minute it always fail, so probably the timeout setting in the config file is not applied properly :


root@nagiosserver:/usr/lib/nagios/plugins# time ./check_nrpe 
C:\Program Files\snclient>start "RVTOOLS-Export-Script"/DC:\R

real    0m14.860s
user    0m0.009s
sys     0m0.001s
root@nagiosserver:/usr/lib/nagios/plugins# time ./check_nrpe 
C:\Program Files\snclient>start "RVTOOLS-Export-Script"/DC:\R

real    0m14.874s
user    0m0.009s
sys     0m0.000s
root@nagiosserver:/usr/lib/nagios/plugins# time ./check_nrpe 
C:\Program Files\snclient>start "RVTOOLS-Export-Script"/DC:\R

real    0m36.078s
user    0m0.007s
sys     0m0.003s
root@nagiosserver:/usr/lib/nagios/plugins# time ./check_nrpe 
UNKNOWN - script run into timeout after 666s

real    1m8.550s
user    0m0.005s
sys     0m0.005s
root@nagiosserver:/usr/lib/nagios/plugins#
sni commented 7 months ago

it should be better now, there was indeed a 60s default timeout and since the shortest one wins, it didn't matter how long the script timeout was. So now at least a debug log entry will be added in this case and it should work if you set both, the default timeout and the scripts timeout long enough:

[/settings/default]
timeout = 300

[/settings/external scripts]
timeout = 300

If that's a good idea is another story...

alexoalexo commented 7 months ago

Setting both now, but still it defaults to 60, here's the trace log:


[2024-02-28 09:54:58.799][Debug][pid:6416][check_wrap:81] reduced cmd timeout to 59s because of shorter context
[2024-02-28 09:54:58.807][Trace][pid:6416][snclient:1324] command object:
 path: C:\Windows\system32\cmd.exe
 args: (none)
 SysProcAttr: &{true C:\Windows\system32\cmd.exe /c C:\Program^ Files\snclient\scripts\generate_rvtools.bat.bat  0 0 <nil> <nil> false [] 0}
[2024-02-28 09:55:49.789][Trace][pid:6416][task_logrotate:93] check logfile rotation (threshold 10 MiB / current size: 2531 B)
[2024-02-28 09:55:57.814][Debug][pid:6416][snclient_windows:207] invalid argument
[2024-02-28 09:56:06.931][Trace][pid:6416][snclient:1063] stdout: 
[2024-02-28 09:56:06.931][Trace][pid:6416][snclient:1064] stderr: 
[2024-02-28 09:56:06.931][Trace][pid:6416][snclient:1065] exitCode: 3
[2024-02-28 09:56:06.931][Trace][pid:6416][snclient:1066] timeout: 59
[2024-02-28 09:56:06.931][Trace][pid:6416][snclient:1067] error: &fmt.wrapError{msg:"timeout: context deadline exceeded", err:context.deadlineExceededError{}}
alexoalexo commented 7 months ago

@sni , can you please re-open the issue and check. I'm ready to test and provide more info if needed. Thank you

For all scripts the timeout is set to 60, and I have

[/settings/default] timeout = 300

[2024-02-28 11:06:32.852][Trace][pid:2024][snclient:1066] timeout: 60

sni commented 7 months ago

i guess the issue is now related to the nrpe listener, i only checked the rest api yesterday.

sni commented 7 months ago

should be fixed in the latest build. Could you try again?

alexoalexo commented 7 months ago

Thanks, looks good now 👍

On longer checks 60+ seconds I get Ok/green in Nagios

Check Latency / Duration: | 0.000 / 94.094 seconds -- | --