Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

API using – disconnected API connection (from Windows 10 OS) causing high CPU usage by Icinga process (epoll_ctl/epoll_wait loop) #6763

Closed jeka-r closed 6 years ago

jeka-r commented 6 years ago

API using – disconnected API connection (from Windows 10 OS) causing high CPU usage by Icinga process (epoll_ctl/epoll_wait loop)

Expected Behavior

Normal CPU usage by Icinga process.

Current Behavior

Icinga process loads CPU high after API connection has been disconnected (from Windows 10 OS).

Possible Solution

n/a

Steps to Reproduce (for bugs)

  1. Check that CPU load is normal on server side
  2. Start Windows 10 PC terminal – curl -k -u ******:****** -H "Accept: application/json" -X POST "https://***************:5665/v1/events?queue=TestCurl&types=StateChange"
  3. Terminate command after 1-2 minutes – Ctrl + C
  4. Check – high CPU usage by Nagios process on server side

Context

We use windows application as API client and found that when we use the application, after 10 – 120 minutes the server CPU shows high load by Nagios process

Example from "top -u nagios -H"

top - 16:43:09 up 10 days,  1:37,  4 users,  load average: 0,40, 0,12, 0,12
Threads: 172 total,   2 running, 170 sleeping,   0 stopped,   0 zombie
%Cpu(s): 35,0 us, 56,7 sy,  8,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,3 si,  0,0 st
KiB Mem :  1015876 total,   173060 free,   302004 used,   540812 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   526224 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 4756 nagios    20   0  645140  24024  15392 R 90,7  2,4   0:25.37 icinga2
 4734 nagios    20   0  645140  24024  15392 S  0,3  2,4   0:00.21 icinga2
 4709 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.15 icinga2
 4732 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.18 icinga2
 4733 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.21 icinga2
 4735 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.21 icinga2
 4736 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.27 icinga2
 4744 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.34 icinga2
 4747 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.02 icinga2
 4748 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4751 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4752 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.06 icinga2
 4753 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.43 icinga2
 4754 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.08 icinga2
 4755 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.08 icinga2
 4757 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4758 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4759 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4760 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4761 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4762 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4763 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4764 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4765 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4766 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.16 icinga2
 4956 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.23 icinga2
 4957 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.12 icinga2
 4958 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.23 icinga2
 4959 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.12 icinga2
 5959 nagios    20   0  645140  24024  15392 S  0,0  2,4   0:00.00 icinga2
 4743 nagios    20   0  427560   6028   4712 S  0,0  0,6   0:00.14 icinga2

Example strace from PID 4756:

……………………
epoll_ctl(29, EPOLL_CTL_MOD, 45, {0, {u32=45, u64=45}}) = 0
epoll_wait(29, [{EPOLLERR|EPOLLHUP, {u32=45, u64=45}}], 64, -1) = 1
epoll_ctl(29, EPOLL_CTL_MOD, 45, {0, {u32=45, u64=45}}) = 0
epoll_wait(29, [{EPOLLERR|EPOLLHUP, {u32=45, u64=45}}], 64, -1) = 1
epoll_ctl(29, EPOLL_CTL_MOD, 45, {0, {u32=45, u64=45}}) = 0
epoll_wait(29, [{EPOLLERR|EPOLLHUP, {u32=45, u64=45}}], 64, -1) = 1
epoll_ctl(29, EPOLL_CTL_MOD, 45, {0, {u32=45, u64=45}}) = 0
epoll_wait(29, [{EPOLLERR|EPOLLHUP, {u32=45, u64=45}}], 64, -1) = 1
epoll_ctl(29, EPOLL_CTL_MOD, 45, {0, {u32=45, u64=45}}) = 0
epoll_wait(29, [{EPOLLERR|EPOLLHUP, {u32=45, u64=45}}], 64, -1) = 1
……………………

There is no CPU load if curl command started and terminated from Ubuntu PC

Your Environment

dnsmichi commented 6 years ago

There is no nagios process here, both instances run Icinga 2. The client runs 2.10.1 and the master 2.9.0?

jeka-r commented 6 years ago

We have two servers with Icinga 2. Both are master, first r2.9.0-1 for test and second r2.10.1-1 as production. Same issue for both.

Api client starts on Windows 10 PC and connects to one of servers and collects StateChange

dnsmichi commented 6 years ago

Please run curl with the most verbose mode you can get, and also share the location on disk. I'd like to know whether this originates from PS, native or is an extra installed binary. This likely creates a different request than the Linux counterparts.

jeka-r commented 6 years ago

Good day.

I have performed some tests and collected info.

  1. check server CPU load
  2. start curl
  3. perform manual service check in Icinga Web
  4. interrupt curl by ctrl+C
  5. check server CPU load
  6. restart icinga

curl version

C:\Users\Evgeniy>curl -V
curl 7.55.1 (Windows) libcurl/7.55.1 WinSSL
Release-Date: [unreleased]
Protocols: dict file ftp ftps http https imap imaps pop3 pop3s smtp smtps telnet tftp
Features: AsynchDNS IPv6 Largefile SSPI Kerberos SPNEGO NTLM SSL

curl in verbose mode, tested 31 sec and interrupted by ctrl+C

C:\Users\Evgeniy>curl -v -k -u ******:******* -H "Accept: application/json" -X POST "https://*******.com:5665/v1/events?queue=TestCurl&types=StateChange" > test-mm
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ***.***.***.***...
* TCP_NODELAY set
* Connected to ims.is-profit.com (***.***.***.***) port 5665 (#0)
* schannel: SSL/TLS connection with *******.com port 5665 (step 1/3)
* schannel: disabled server certificate revocation checks
* schannel: verifyhost setting prevents Schannel from comparing the supplied target name with the subject names in server certificates.
* schannel: sending initial handshake data: sending 177 bytes...
* schannel: sent initial handshake data: sent 177 bytes
* schannel: SSL/TLS connection with *******.com port 5665 (step 2/3)
* schannel: failed to receive handshake, need more data
* schannel: SSL/TLS connection with *******.com port 5665 (step 2/3)
* schannel: encrypted data got 2645
* schannel: encrypted data buffer: offset 2645 length 4096
* schannel: a client certificate has been requested
* schannel: SSL/TLS connection with *******.com port 5665 (step 2/3)
* schannel: encrypted data buffer: offset 2645 length 4096
* schannel: sending next handshake data: sending 581 bytes...
* schannel: SSL/TLS connection with *******.com port 5665 (step 2/3)
* schannel: encrypted data got 258
* schannel: encrypted data buffer: offset 258 length 4096
* schannel: SSL/TLS handshake complete
* schannel: SSL/TLS connection with *******.com port 5665 (step 3/3)
* schannel: stored credential handle in session cache
* Server auth using Basic with user '*****************'
> POST /v1/events?queue=TestCurl&types=StateChange HTTP/1.1
> Host: *******.com:5665
> Authorization: Basic *********************************************************=
> User-Agent: curl/7.55.1
> Accept: application/json
>
* schannel: client wants to read 102400 bytes
* schannel: encdata_buffer resized 103424
* schannel: encrypted data buffer: offset 0 length 103424
* schannel: encrypted data got 46
* schannel: encrypted data buffer: offset 46 length 103424
* schannel: decrypted data length: 17
* schannel: decrypted data added: 17
* schannel: decrypted data cached: offset 17 length 102400
* schannel: encrypted data buffer: offset 0 length 103424
* schannel: decrypted data buffer: offset 17 length 102400
* schannel: schannel_recv cleanup
* schannel: decrypted data returned 17
* schannel: decrypted data buffer: offset 0 length 102400
< HTTP/1.1 200 OK
  0     0    0     0    0     0      0      0 --:--:--  0:00:31 --:--:--     0^C

outputed test-mm is empty in current test, but some info from one of previous test

{"check_result":{"active":true,"check_source":"***********.com","command":["/usr/lib/nagios/plugins/check_apt"],"execution_end":1542100510.6113131046,"execution_start":1542100509.4333910942,"exit_status":0.0,"output":"APT OK: 0 packages available for upgrade (0 critical updates). ","performance_data":["available_upgrades=0;;;0","critical_updates=0;;;0"],"schedule_end":1542100510.6114239693,"schedule_start":1542100510.6114239693,"state":0.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":2.0,"state_type":1.0}},"host":"*************.com","service":"apt","state":0.0,"state_type":1.0,"timestamp":1542100510.6188690662,"type":"StateChange"}

info from icinga log file during the test

[2018-11-13 13:48:13 +0300] information/ApiListener: New client connection from [***.***.***.***]:23902 (no client certificate)
[2018-11-13 13:48:13 +0300] information/HttpServerConnection: Request: POST /v1/events?queue=TestCurl&types=StateChange (from [***.***.***.***]:23902, user: ***************)
[2018-11-13 13:48:23 +0300] information/WorkQueue: #11 (HttpServerConnection) items: 1, rate:  0/s (0/min 0/5min 0/15min); empty in 17848 days, 10 hours, 48 minutes and 23 seconds
[2018-11-13 13:48:25 +0300] information/ExternalCommandListener: Executing external command: [1542106105] SCHEDULE_FORCED_SVC_CHECK;***********.com;apt;1542106105
[2018-11-13 13:48:33 +0300] information/WorkQueue: #11 (HttpServerConnection) items: 1, rate:  0/s (0/min 0/5min 0/15min); empty in -2147483648 days
[2018-11-13 13:48:34 +0300] information/ExternalCommandListener: Executing external command: [1542106114] SCHEDULE_FORCED_SVC_CHECK;***********.com;load;1542106114
[2018-11-13 13:48:38 +0300] information/ExternalCommandListener: Executing external command: [1542106118] SCHEDULE_FORCED_SVC_CHECK;***********.com;procs;1542106118
[2018-11-13 13:48:43 +0300] information/WorkQueue: #11 (HttpServerConnection) items: 1, rate:  0/s (0/min 0/5min 0/15min); empty in -2147483648 days
[2018-11-13 13:48:52 +0300] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 6, rate: 3.01667/s (181/min 881/5min 2241/15min);
[2018-11-13 13:48:53 +0300] information/WorkQueue: #11 (HttpServerConnection) items: 1, rate:  0/s (0/min 0/5min 0/15min); empty in -2147483648 days
[2018-11-13 13:48:58 +0300] information/Application: Received request to shut down.
[2018-11-13 13:48:59 +0300] information/Application: Shutting down...
[2018-11-13 13:48:59 +0300] information/ApiListener: 'api' stopped.
[2018-11-13 13:48:59 +0300] information/CheckerComponent: 'checker' stopped.
[2018-11-13 13:48:59 +0300] information/ExternalCommandListener: 'command' stopped.
[2018-11-13 13:48:59 +0300] information/ApiListener: 'api' started.

please inform if any test/info required

jeka-r commented 6 years ago

Yesterday, Icinga2 has been updated to r2.10.2-1 version in one of our servers. Problem no longer exists.

Second server, with version r2.9.0-1, problem is present.

Have any changes been made in the new version (r2.10.2-1) to the related issue?

dnsmichi commented 6 years ago

Yes, definitely. Check the release blog posts and changelog. I assume this issue is now solved.

jeka-r commented 6 years ago

I want to note, a week later, the issue is no longer repeated.

Perhaps the problem was solved with the installation of the version r2.10.2-1, but I did not find mention of this issue in the changelog.

Many thanks for the support and help.