Closed fbs closed 4 years ago
Our smaller accp setup shows similar behaviour, it has less config so forking doesn't seem to suffer as much but the memory diff is still visible:
information/ConfigItem: Instantiated 202 Hosts.
information/ConfigItem: Instantiated 3221 Dependencies.
information/ConfigItem: Instantiated 3221 Services.
2.12:
$ ps -g 21344 efl
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 994 21344 1 20 0 339556 7264 hrtime Ssl ? 0:05 /usr/lib64/icinga2/sbin/icinga2
1 994 21365 21344 20 0 1197528 180236 hrtime Sl ? 8:05 \_ /usr/lib64/icinga2/sbin/icinga2
1 994 21391 21365 20 0 933824 87788 unix_s S ? 8:54 \_ /usr/lib64/icinga2/sbin/icinga2
2.11:
$ ps -g 7057 efl
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 994 7057 1 20 0 338888 6544 hrtime Ssl ? 0:04 /usr/lib64/icinga2/sbin/icinga2
1 994 7078 7057 20 0 338880 2700 unix_s S ? 2:06 \_ /usr/lib64/icinga2/sbin/icinga2
1 994 7079 7057 20 0 1196488 145896 hrtime Sl ? 8:16 \_ /usr/lib64/icinga2/sbin/icinga2
On 2.11 7078 forks new checks. On 2.12 the much larger 21391 handles it.
Overal we don't see an increase in memory usage on our icinga machines. So it seems like it's just a different order of process forking causing this.
Hello @fbs and thank you for reporting!
Does this also occur with v2.12.0-rc1?
Best, AK
I tried installing 2.12.0-0.rc1.1 from the Testing repo. I am seeing the same problem as with 2.12. My case is around 25,000 services...0 latency on 2.11.5, two minutes or more on 2.12.
I think I've got the latency.
---
- hosts: aklimov8196-helper
become: yes
become_method: sudo
tasks:
- name: Docker repo
get_url:
url: https://download.docker.com/linux/centos/docker-ce.repo
dest: /etc/yum.repos.d/docker-ce.repo
- name: Packages
loop:
- epel-release
- nginx
- git
- docker-ce
- createrepo
yum:
name: '{{ item }}'
- name: Services
loop:
- nginx
- docker
service:
name: '{{ item }}'
state: started
enabled: yes
- name: SELinux
selinux:
state: disabled
- name: icinga2.git
git:
repo: https://github.com/Icinga/icinga2.git
bare: yes
dest: /var/local/icinga2.git
- name: icinga2-*
loop:
- v2.12.0
- v2.11.5
git:
repo: file:///var/local/icinga2.git
version: '{{ item }}'
dest: '/var/local/icinga2-{{ item }}'
register: icinga2
- name: Upstream branch
loop: '{{ icinga2.results }}'
when: item.changed
shell: >-
set -exo pipefail;
cd /var/local/icinga2-{{ item.item }};
git branch -f master
args:
executable: /bin/bash
- name: rpm-icinga2.git
git:
repo: https://git.icinga.com/packaging/rpm-icinga2.git
bare: yes
dest: /var/local/rpm-icinga2.git
- name: rpm-icinga2-*
loop: '{{ icinga2.results }}'
git:
repo: file:///var/local/rpm-icinga2.git
dest: '/var/local/rpm-icinga2-{{ item.item }}'
- name: chown rpm-icinga2-*
loop: '{{ icinga2.results }}'
file:
path: '/var/local/rpm-icinga2-{{ item.item }}'
state: directory
recurse: yes
owner: '1000'
group: '1000'
- name: Icinga 2 packages
loop: '{{ icinga2.results }}'
shell: >-
set -exo pipefail;
docker run --rm
-v /var/local/icinga2-{{ item.item }}/.git:/icinga2.git:ro
-v /var/local/rpm-icinga2-{{ item.item }}:/rpm-icinga2
-w /rpm-icinga2
-e ICINGA_BUILD_PROJECT=icinga2
-e ICINGA_BUILD_TYPE=snapshot
-e UPSTREAM_GIT_URL=file:///icinga2.git
-e ICINGA_BUILD_UPSTREAM_BRANCH=master
registry.icinga.com/build-docker/centos/7
icinga-build-package
args:
executable: /bin/bash
creates: '/var/local/rpm-icinga2-{{ item.item }}/build/rpmlint-icinga2-bin.log'
- name: createrepo
loop: '{{ icinga2.results }}'
shell: >-
set -exo pipefail;
createrepo /var/local/rpm-icinga2-{{ item.item }}/build/RPMS/x86_64
args:
executable: /bin/bash
creates: '/var/local/rpm-icinga2-{{ item.item }}/build/RPMS/x86_64/repodata'
- name: Serve Icinga 2
loop: '{{ icinga2.results }}'
file:
path: '/usr/share/nginx/html/icinga2-{{ item.item }}'
state: link
src: '/var/local/rpm-icinga2-{{ item.item }}/build/RPMS'
- hosts: 'all:!aklimov8196-helper'
become: yes
become_method: sudo
vars:
icinga2:
aklimov8196-master1: v2.12.0
aklimov8196-master2: v2.11.5
tasks:
- name: /etc/yum.repos.d/our.repo
copy:
dest: /etc/yum.repos.d/our.repo
content: |
[icinga2]
baseurl=http://{{ hostvars['aklimov8196-helper'].ansible_default_ipv4.address }}/icinga2-{{ icinga2[inventory_hostname] }}/$basearch
enabled=1
gpgcheck=0
- name: Packages
loop:
- epel-release
- nagios-plugins-all
- icinga2
yum:
name: '{{ item }}'
- name: Start Icinga 2
service:
name: icinga2
state: started
enabled: yes
---
- hosts: aklimov8196-helper
become: yes
become_method: sudo
tasks:
- name: Packages
loop:
- https://dl.influxdata.com/influxdb/releases/influxdb-1.8.2.x86_64.rpm
- epel-release
- python2-pip
yum:
name: '{{ item }}'
- name: python-influxdb
pip:
name: influxdb
- name: Services
loop:
- influxdb
service:
name: '{{ item }}'
state: started
enabled: yes
- name: Influx DB
influxdb_database:
database_name: icinga2
- name: Influx user
influxdb_user:
user_name: icinga2
user_password: icinga2
grants:
- database: icinga2
privilege: ALL
- hosts: 'all:!aklimov8196-helper'
become: yes
become_method: sudo
tasks:
- name: /etc/icinga2/features-available/influxdb.conf
copy:
dest: /etc/icinga2/features-available/influxdb.conf
content: |
object InfluxdbWriter "influxdb" {
host = "{{ hostvars['aklimov8196-helper'].ansible_default_ipv4.address }}"
port = 8086
username = "icinga2"
password = "icinga2"
database = "icinga2"
flush_threshold = 1024
flush_interval = 10s
host_template = {
measurement = "$host.check_command$"
tags = {
hostname = "$host.name$"
}
}
service_template = {
measurement = "$service.check_command$"
tags = {
hostname = "$host.name$"
service = "$service.name$"
}
}
}
notify: Icinga 2
- name: Icinga 2 features
loop:
- influxdb
file:
path: '/etc/icinga2/features-enabled/{{ item }}.conf'
state: link
src: '/etc/icinga2/features-available/{{ item }}.conf'
notify: Icinga 2
handlers:
- name: Icinga 2
service:
name: icinga2
state: reloaded
---
- hosts: aklimov8196-helper
become: yes
become_method: sudo
tasks:
- name: Packages
loop:
- https://dl.grafana.com/oss/release/grafana-7.1.5-1.x86_64.rpm
yum:
name: '{{ item }}'
- name: Services
loop:
- grafana-server
service:
name: '{{ item }}'
state: started
enabled: yes
---
- hosts: 'all:!aklimov8196-helper'
become: yes
become_method: sudo
tasks:
- name: Packages
loop:
- git
yum:
name: '{{ item }}'
- name: /opt/justintime-plugins
git:
repo: https://github.com/justintime/nagios-plugins.git
dest: /opt/justintime-plugins
- name: check_mem.pl
file:
path: /usr/lib64/nagios/plugins/check_mem.pl
state: link
src: /opt/justintime-plugins/check_mem/check_mem.pl
- name: /etc/icinga2/conf.d/my.conf
copy:
dest: /etc/icinga2/conf.d/my.conf
content: |
apply Service "mem" {
import "generic-service"
check_command = "mem"
vars.mem_used = true
vars.mem_cache = true
vars.mem_warning = 80
vars.mem_critical = 90
assign where host.name == NodeName
}
apply Service "ping-" for (i in range(25000)) {
import "generic-service"
check_interval = 1.125m
retry_interval = 1.125m
check_command = "hostalive"
assign where host.name == NodeName
}
const MaxConcurrentChecks = 40000
template CheckerComponent default {
concurrent_checks = 40000
}
apply Service "latency" {
import "generic-service"
check_command = "dummy"
vars.dummy_text = () => {
var latencies = get_objects(Service).map(s => s.last_check_result.execution_end - s.last_check_result.schedule_start)
return " |latency=" + latencies.reduce((a, b) => a + b) / latencies.len() + "s"
}
assign where host.name == NodeName
}
notify: Icinga 2
- name: check_interval
replace:
path: /etc/icinga2/conf.d/templates.conf
regexp: '\bcheck_interval = .+'
replace: check_interval = 1s
notify: Icinga 2
- name: retry_interval
replace:
path: /etc/icinga2/conf.d/templates.conf
regexp: '\bretry_interval = .+'
replace: retry_interval = 1s
notify: Icinga 2
- name: load vars.load_percpu
blockinfile:
path: /etc/icinga2/conf.d/services.conf
insertafter: 'apply Service "load" {'
block: |
vars.load_percpu = true
notify: Icinga 2
handlers:
- name: Icinga 2
service:
name: icinga2
state: restarted
CPU state might still be interesting to look at to confirm. If it spends a way higher amount in kernel/system you've probably hit it.
I've hit 752c5998df3970 as first bad commit during bisecting... which is impossible.
Short update on this issue. We're considering https://github.com/Icinga/icinga2/pull/7984 for the 2.13.0, but still need a fix for 2.12.x.
After a bit more bisecting by @Al2Klimov, we found https://github.com/Icinga/icinga2/pull/7606 to most likely be the cause of the problem:
master
latency=15.169704s
latency=30.069732s
latency=21.865360s
latency=24.975991s
latency=29.425227s
reverted #7606
latency=0.028997s
latency=0.035243s
latency=0.037570s
latency=0.047883s
latency=0.038054s
Just reverting that PR would leave us with our old zombie process bug, so we're currently having a deeper look into the process spawner to find a solution for both problems.
Closed this issue because #8276 has been merged.
Thanks for the quick fix all! Haven't had time to test it yet but it looks promising.
Describe the bug
After upgrading to 2.12.0 we noticed a very high latency on all checks. The icinga machine reports a high amount of time spent system, before the upgrade this was ~10%, now it's 90%.
In our environment we have 2 icinga nodes doing checks, one is the "config leader" that pushes config (documentation calls it Top Down configuration sync). Both nodes should execute around the same amount of checks. To be able to compare the two we've downgraded the "config follower" to 2.11.5-1. We've only downgraded icinga, it runs the same kernel and other software.
Using bcc's
syscount
we can see how many syscalls icinga2 executes and how long they take, printing it every 10 seconds.On the 2.12 leader we see (saw similar numbers on follower before downgrade):
And on the follower:
The leader seems to max out around 300 clones/forks quite consistently (so 30/s), while I've seen the follower spike up 2k (200/s). Also note how the leader takes almost 10s to do those 300 clones, while the follower spends all it's time in recv.
Using a flamegraph we can dig a bit deeper
With the leader on the left and follower on the right we can see a clear difference. On the follower libc_fork is barely visible (its the 2nd column from the left), while it takes up half the screen on the leader
Comparing the memory usage:
leader:
follower:
So everything seems to point at the increased memory usage being the culprit.
After downgrading the leader:
And memory:
Comparing the leader between broken and ok we see a huge memory diff:
In
ps
we see a difference too. On 2.12 we see an extra icinga fork, not sure how relevant that is.I got a copy of
/proc/<pid>
for 2.12 too, so if any other info is required I can probably get it (like proc/maps).To Reproduce
Have quite a few hosts and checks.
Expected behavior
Low check latency, more clones
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):Leader:
up to date (as of yesterday) centos 7
icinga2 feature list
):icinga2 daemon -C
):