Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

High check latency after 2.12 upgrade / memory usage #8196

Closed fbs closed 4 years ago

fbs commented 4 years ago

Describe the bug

After upgrading to 2.12.0 we noticed a very high latency on all checks. The icinga machine reports a high amount of time spent system, before the upgrade this was ~10%, now it's 90%.

In our environment we have 2 icinga nodes doing checks, one is the "config leader" that pushes config (documentation calls it Top Down configuration sync). Both nodes should execute around the same amount of checks. To be able to compare the two we've downgraded the "config follower" to 2.11.5-1. We've only downgraded icinga, it runs the same kernel and other software.

Using bcc's syscount we can see how many syscalls icinga2 executes and how long they take, printing it every 10 seconds.

# /usr/share/bcc/tools/syscount  -T 30 -i 10 -L -m -p 2093

On the 2.12 leader we see (saw similar numbers on follower before downgrade):

[15:33:52]
SYSCALL                   COUNT        TIME (ms)
clone                       299      9215.332333
recvmsg                     579       584.742680
sendto                      578        24.774361
wait4                       279         6.216611
recvfrom                    579         3.767298
close                       897         2.192999

[15:34:02]
SYSCALL                   COUNT        TIME (ms)
clone                       278      9261.375636
recvmsg                     561       579.717236
sendto                      562        24.195989
recvfrom                    561         7.249004
wait4                       284         6.667528
close                       834         1.823393

And on the follower:

[15:33:53]
SYSCALL                   COUNT        TIME (ms)
recvmsg                    1347      9477.930638
clone                       670       333.633895
wait4                       678        13.700700
sendto                     1348         6.523638
recvfrom                   1348         3.931384
close                      2010         2.529605

[15:34:03]
SYSCALL                   COUNT        TIME (ms)
recvmsg                     643      9734.994736
clone                       322       170.889077
wait4                       321         7.154626
sendto                      643         3.640546
recvfrom                    643         2.468145
close                       966         1.337561

The leader seems to max out around 300 clones/forks quite consistently (so 30/s), while I've seen the follower spike up 2k (200/s). Also note how the leader takes almost 10s to do those 300 clones, while the follower spends all it's time in recv.

Using a flamegraph we can dig a bit deeper

Screenshot 2020-08-27 at 16 23 29

With the leader on the left and follower on the right we can see a clear difference. On the follower libc_fork is barely visible (its the 2nd column from the left), while it takes up half the screen on the leader

Comparing the memory usage:

leader:

Groups: 988 990 991 1001
VmPeak:  3247944 kB
VmSize:  3247944 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   1270624 kB
VmRSS:   1270624 kB
RssAnon:     1269784 kB
RssFile:         840 kB
RssShmem:          0 kB
VmData:  3099980 kB
VmStk:       132 kB
VmExe:     10936 kB
VmLib:     39376 kB
VmPTE:      2852 kB
VmSwap:        0 kB
Threads:    1

follower:

Groups: 988 990 991 1001
VmPeak:   735216 kB
VmSize:   735216 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      2864 kB
VmRSS:      2864 kB
RssAnon:        1944 kB
RssFile:         920 kB
RssShmem:          0 kB
VmData:   595696 kB
VmStk:       132 kB
VmExe:     10288 kB
VmLib:     36168 kB
VmPTE:       268 kB
VmSwap:        0 kB
Threads:    1

So everything seems to point at the increased memory usage being the culprit.

After downgrading the leader:

Tracing syscalls, printing top 10... Ctrl+C to quit.
[16:30:51]
SYSCALL                   COUNT        TIME (ms)
recvmsg                    3807      7675.037480
clone                      1923      1408.668509
sendto                     3808        66.448496
recvfrom                   3808        58.546543
wait4                      1885        46.540410
close                      5769         8.401268

[16:31:01]
SYSCALL                   COUNT        TIME (ms)
recvmsg                    4533      7531.316624
clone                      2274      1475.591478
wait4                      2258        60.508798
sendto                     4532        56.665480
recvfrom                   4533        31.999594
close                      6822         9.985904

And memory:

VmPeak:   735216 kB
VmSize:   735216 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      2864 kB
VmRSS:      2864 kB
RssAnon:        1944 kB
RssFile:         920 kB
RssShmem:          0 kB
VmData:   595696 kB
VmStk:       132 kB
VmExe:     10288 kB
VmLib:     36168 kB
VmPTE:       268 kB
VmSwap:        0 kB

Comparing the leader between broken and ok we see a huge memory diff:

join memory.broken memory.ok | awk '/kB/ { if ($2 == 0) {print} else {print ($0, $2/$4, "x")} }' | column -tx
VmPeak:    3247944  kB  735216  kB  4.41767   x
VmSize:    3247944  kB  735216  kB  4.41767   x
VmLck:     0        kB  0       kB
VmPin:     0        kB  0       kB
VmHWM:     1270624  kB  2864    kB  443.654   x
VmRSS:     1270624  kB  2864    kB  443.654   x
RssAnon:   1269784  kB  1944    kB  653.181   x
RssFile:   840      kB  920     kB  0.913043  x
RssShmem:  0        kB  0       kB
VmData:    3099980  kB  595696  kB  5.20396   x
VmStk:     132      kB  132     kB  1         x
VmExe:     10936    kB  10288   kB  1.06299   x
VmLib:     39376    kB  36168   kB  1.0887    x
VmPTE:     2852     kB  268     kB  10.6418   x
VmSwap:    0        kB  0       kB

In ps we see a difference too. On 2.12 we see an extra icinga fork, not sure how relevant that is.

I got a copy of /proc/<pid> for 2.12 too, so if any other info is required I can probably get it (like proc/maps).

To Reproduce

Have quite a few hosts and checks.

Expected behavior

Low check latency, more clones

Your Environment

Include as many relevant details about the environment you experienced the problem in

Leader:

icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.0-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1127.18.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-hh8q3bz2-project-322-concurrent-0
  OpenSSL version: OpenSSL 1.0.2k-fips  26 Jan 2017

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.5-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1127.18.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-hh8q3bz2-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

up to date (as of yesterday) centos 7

Disabled features: compatlog debuglog elasticsearch gelf graphite icingadb influxdb opentsdb perfdata statusdata syslog
Enabled features: api checker command ido-mysql livestatus mainlog notification
information/cli: Icinga application loader (version: 2.12.0-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/CheckerComponent: Attribute 'concurrent_checks' for object 'checker' of type 'CheckerComponent' is deprecated and should not be used.
information/ApiListener: My API identity: n01.master
information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 32.8167/s (1969/min 1969/5min 1969/15min);
information/WorkQueue: #5 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 1866 Hosts.
information/ConfigItem: Instantiated 427 Downtimes.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 2202 Comments.
information/ConfigItem: Instantiated 79360 Notifications.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 28 HostGroups.
information/ConfigItem: Instantiated 4 EventCommands.
information/ConfigItem: Instantiated 40359 Dependencies.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 2 Zones.
information/ConfigItem: Instantiated 2 Endpoints.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ConfigItem: Instantiated 9 ApiUsers.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 241 CheckCommands.
information/ConfigItem: Instantiated 1 LivestatusListener.
information/ConfigItem: Instantiated 4 TimePeriods.
information/ConfigItem: Instantiated 2 Users.
information/ConfigItem: Instantiated 40359 Services.
information/ConfigItem: Instantiated 19 ScheduledDowntimes.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
fbs commented 4 years ago

Our smaller accp setup shows similar behaviour, it has less config so forking doesn't seem to suffer as much but the memory diff is still visible:

information/ConfigItem: Instantiated 202 Hosts.
information/ConfigItem: Instantiated 3221 Dependencies.
information/ConfigItem: Instantiated 3221 Services.

2.12:

$ ps -g 21344 efl
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4   994 21344     1  20   0 339556  7264 hrtime Ssl  ?          0:05 /usr/lib64/icinga2/sbin/icinga2
1   994 21365 21344  20   0 1197528 180236 hrtime Sl ?          8:05  \_ /usr/lib64/icinga2/sbin/icinga2
1   994 21391 21365  20   0 933824 87788 unix_s S    ?          8:54      \_ /usr/lib64/icinga2/sbin/icinga2

2.11:

$ ps -g 7057 efl
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4   994  7057     1  20   0 338888  6544 hrtime Ssl  ?          0:04 /usr/lib64/icinga2/sbin/icinga2
1   994  7078  7057  20   0 338880  2700 unix_s S    ?          2:06  \_ /usr/lib64/icinga2/sbin/icinga2
1   994  7079  7057  20   0 1196488 145896 hrtime Sl ?          8:16  \_ /usr/lib64/icinga2/sbin/icinga2

On 2.11 7078 forks new checks. On 2.12 the much larger 21391 handles it.

Overal we don't see an increase in memory usage on our icinga machines. So it seems like it's just a different order of process forking causing this.

Al2Klimov commented 4 years ago

Hello @fbs and thank you for reporting!

Does this also occur with v2.12.0-rc1?

Best, AK

mdetrano commented 4 years ago

I tried installing 2.12.0-0.rc1.1 from the Testing repo. I am seeing the same problem as with 2.12. My case is around 25,000 services...0 latency on 2.11.5, two minutes or more on 2.12.

Al2Klimov commented 4 years ago

I think I've got the latency.

playbooks.d/icinga2.yml

---
- hosts: aklimov8196-helper
  become: yes
  become_method: sudo
  tasks:
  - name: Docker repo
    get_url:
      url: https://download.docker.com/linux/centos/docker-ce.repo
      dest: /etc/yum.repos.d/docker-ce.repo
  - name: Packages
    loop:
    - epel-release
    - nginx
    - git
    - docker-ce
    - createrepo
    yum:
      name: '{{ item }}'
  - name: Services
    loop:
    - nginx
    - docker
    service:
      name: '{{ item }}'
      state: started
      enabled: yes
  - name: SELinux
    selinux:
      state: disabled
  - name: icinga2.git
    git:
      repo: https://github.com/Icinga/icinga2.git
      bare: yes
      dest: /var/local/icinga2.git
  - name: icinga2-*
    loop:
    - v2.12.0
    - v2.11.5
    git:
      repo: file:///var/local/icinga2.git
      version: '{{ item }}'
      dest: '/var/local/icinga2-{{ item }}'
    register: icinga2
  - name: Upstream branch
    loop: '{{ icinga2.results }}'
    when: item.changed
    shell: >-
      set -exo pipefail;
      cd /var/local/icinga2-{{ item.item }};
      git branch -f master
    args:
      executable: /bin/bash
  - name: rpm-icinga2.git
    git:
      repo: https://git.icinga.com/packaging/rpm-icinga2.git
      bare: yes
      dest: /var/local/rpm-icinga2.git
  - name: rpm-icinga2-*
    loop: '{{ icinga2.results }}'
    git:
      repo: file:///var/local/rpm-icinga2.git
      dest: '/var/local/rpm-icinga2-{{ item.item }}'
  - name: chown rpm-icinga2-*
    loop: '{{ icinga2.results }}'
    file:
      path: '/var/local/rpm-icinga2-{{ item.item }}'
      state: directory
      recurse: yes
      owner: '1000'
      group: '1000'
  - name: Icinga 2 packages
    loop: '{{ icinga2.results }}'
    shell: >-
      set -exo pipefail;
      docker run --rm
      -v /var/local/icinga2-{{ item.item }}/.git:/icinga2.git:ro
      -v /var/local/rpm-icinga2-{{ item.item }}:/rpm-icinga2
      -w /rpm-icinga2
      -e ICINGA_BUILD_PROJECT=icinga2
      -e ICINGA_BUILD_TYPE=snapshot
      -e UPSTREAM_GIT_URL=file:///icinga2.git
      -e ICINGA_BUILD_UPSTREAM_BRANCH=master
      registry.icinga.com/build-docker/centos/7
      icinga-build-package
    args:
      executable: /bin/bash
      creates: '/var/local/rpm-icinga2-{{ item.item }}/build/rpmlint-icinga2-bin.log'
  - name: createrepo
    loop: '{{ icinga2.results }}'
    shell: >-
      set -exo pipefail;
      createrepo /var/local/rpm-icinga2-{{ item.item }}/build/RPMS/x86_64
    args:
      executable: /bin/bash
      creates: '/var/local/rpm-icinga2-{{ item.item }}/build/RPMS/x86_64/repodata'
  - name: Serve Icinga 2
    loop: '{{ icinga2.results }}'
    file:
      path: '/usr/share/nginx/html/icinga2-{{ item.item }}'
      state: link
      src: '/var/local/rpm-icinga2-{{ item.item }}/build/RPMS'
- hosts: 'all:!aklimov8196-helper'
  become: yes
  become_method: sudo
  vars:
    icinga2:
      aklimov8196-master1: v2.12.0
      aklimov8196-master2: v2.11.5
  tasks:
  - name: /etc/yum.repos.d/our.repo
    copy:
      dest: /etc/yum.repos.d/our.repo
      content: |
        [icinga2]
        baseurl=http://{{ hostvars['aklimov8196-helper'].ansible_default_ipv4.address }}/icinga2-{{ icinga2[inventory_hostname] }}/$basearch
        enabled=1
        gpgcheck=0
  - name: Packages
    loop:
    - epel-release
    - nagios-plugins-all
    - icinga2
    yum:
      name: '{{ item }}'
  - name: Start Icinga 2
    service:
      name: icinga2
      state: started
      enabled: yes

playbooks.d/influx.yml

---
- hosts: aklimov8196-helper
  become: yes
  become_method: sudo
  tasks:
  - name: Packages
    loop:
    - https://dl.influxdata.com/influxdb/releases/influxdb-1.8.2.x86_64.rpm
    - epel-release
    - python2-pip
    yum:
      name: '{{ item }}'
  - name: python-influxdb
    pip:
      name: influxdb
  - name: Services
    loop:
    - influxdb
    service:
      name: '{{ item }}'
      state: started
      enabled: yes
  - name: Influx DB
    influxdb_database:
      database_name: icinga2
  - name: Influx user
    influxdb_user:
      user_name: icinga2
      user_password: icinga2
      grants:
      - database: icinga2
        privilege: ALL
- hosts: 'all:!aklimov8196-helper'
  become: yes
  become_method: sudo
  tasks:
  - name: /etc/icinga2/features-available/influxdb.conf
    copy:
      dest: /etc/icinga2/features-available/influxdb.conf
      content: |
        object InfluxdbWriter "influxdb" {
          host = "{{ hostvars['aklimov8196-helper'].ansible_default_ipv4.address }}"
          port = 8086
          username = "icinga2"
          password = "icinga2"
          database = "icinga2"
          flush_threshold = 1024
          flush_interval = 10s
          host_template = {
            measurement = "$host.check_command$"
            tags = {
              hostname = "$host.name$"
            }
          }
          service_template = {
            measurement = "$service.check_command$"
            tags = {
              hostname = "$host.name$"
              service = "$service.name$"
            }
          }
        }
    notify: Icinga 2
  - name: Icinga 2 features
    loop:
    - influxdb
    file:
      path: '/etc/icinga2/features-enabled/{{ item }}.conf'
      state: link
      src: '/etc/icinga2/features-available/{{ item }}.conf'
    notify: Icinga 2
  handlers:
  - name: Icinga 2
    service:
      name: icinga2
      state: reloaded

playbooks.d/grafana.yml

---
- hosts: aklimov8196-helper
  become: yes
  become_method: sudo
  tasks:
  - name: Packages
    loop:
    - https://dl.grafana.com/oss/release/grafana-7.1.5-1.x86_64.rpm
    yum:
      name: '{{ item }}'
  - name: Services
    loop:
    - grafana-server
    service:
      name: '{{ item }}'
      state: started
      enabled: yes

playbooks.d/monitor.yml

---
- hosts: 'all:!aklimov8196-helper'
  become: yes
  become_method: sudo
  tasks:
  - name: Packages
    loop:
    - git
    yum:
      name: '{{ item }}'
  - name: /opt/justintime-plugins
    git:
      repo: https://github.com/justintime/nagios-plugins.git
      dest: /opt/justintime-plugins
  - name: check_mem.pl
    file:
      path: /usr/lib64/nagios/plugins/check_mem.pl
      state: link
      src: /opt/justintime-plugins/check_mem/check_mem.pl
  - name: /etc/icinga2/conf.d/my.conf
    copy:
      dest: /etc/icinga2/conf.d/my.conf
      content: |
        apply Service "mem" {
          import "generic-service"
          check_command = "mem"
          vars.mem_used = true
          vars.mem_cache = true
          vars.mem_warning = 80
          vars.mem_critical = 90
          assign where host.name == NodeName
        }
        apply Service "ping-" for (i in range(25000)) {
          import "generic-service"
          check_interval = 1.125m
          retry_interval = 1.125m
          check_command = "hostalive"
          assign where host.name == NodeName
        }
        const MaxConcurrentChecks = 40000
        template CheckerComponent default {
          concurrent_checks = 40000
        }
        apply Service "latency" {
          import "generic-service"
          check_command = "dummy"
          vars.dummy_text = () => {
            var latencies = get_objects(Service).map(s => s.last_check_result.execution_end - s.last_check_result.schedule_start)
            return " |latency=" + latencies.reduce((a, b) => a + b) / latencies.len() + "s"
          }
          assign where host.name == NodeName
        }
    notify: Icinga 2
  - name: check_interval
    replace:
      path: /etc/icinga2/conf.d/templates.conf
      regexp: '\bcheck_interval = .+'
      replace: check_interval = 1s
    notify: Icinga 2
  - name: retry_interval
    replace:
      path: /etc/icinga2/conf.d/templates.conf
      regexp: '\bretry_interval = .+'
      replace: retry_interval = 1s
    notify: Icinga 2
  - name: load vars.load_percpu
    blockinfile:
      path: /etc/icinga2/conf.d/services.conf
      insertafter: 'apply Service "load" {'
      block: |
        vars.load_percpu = true
    notify: Icinga 2
  handlers:
  - name: Icinga 2
    service:
      name: icinga2
      state: restarted

v2.11.5

Bildschirmfoto 2020-09-10 um 13 08 32

v2.12.0

Bildschirmfoto 2020-09-10 um 13 08 37
fbs commented 4 years ago

CPU state might still be interesting to look at to confirm. If it spends a way higher amount in kernel/system you've probably hit it.

Al2Klimov commented 4 years ago

I've hit 752c5998df3970 as first bad commit during bisecting... which is impossible.

Al2Klimov commented 4 years ago

7984 to the rescue!

v2.12.0

Bildschirmfoto 2020-09-16 um 12 20 42

v2.11.5

Bildschirmfoto 2020-09-16 um 12 20 49

7984

Bildschirmfoto 2020-09-16 um 12 20 56

N-o-X commented 4 years ago

Short update on this issue. We're considering https://github.com/Icinga/icinga2/pull/7984 for the 2.13.0, but still need a fix for 2.12.x.

After a bit more bisecting by @Al2Klimov, we found https://github.com/Icinga/icinga2/pull/7606 to most likely be the cause of the problem:

master

latency=15.169704s
latency=30.069732s
latency=21.865360s
latency=24.975991s
latency=29.425227s

reverted #7606

latency=0.028997s
latency=0.035243s
latency=0.037570s
latency=0.047883s
latency=0.038054s

Just reverting that PR would leave us with our old zombie process bug, so we're currently having a deeper look into the process spawner to find a solution for both problems.

icinga-probot[bot] commented 4 years ago

Closed this issue because #8276 has been merged.

fbs commented 4 years ago

Thanks for the quick fix all! Haven't had time to test it yet but it looks promising.