Closed dnsmichi closed 6 years ago
@lippserd: could you please schedule the essential part of this (get get_host()
working as before) for v2.9.2? Related issues are #6522 and Director #1579.
The problem appears to be how ConfigItem::CommitNewItems
commits items:
Lines 405 through 435 generate a list of items that are to be committed. By virtue of how elements are ordered in the std::map m_Items
you end up with a list that's alphabetically sorted by the type.
E.g.:
[ \<all hosts>, ..., \<all services> ]
(because 'H' comes before 'S' in the alphabet).
And that's why that worked in \<2.9. I.e., mostly by accident. It was never guaranteed to behave like this, but it did.
2.9 - or more specifically d9010c7b9f - introduced ParallelFor() for work queues which made the order non-deterministic.
I suggest 7e0fb2dc3dc335c56668cb21679b21905a710bbb fiddle/commit-order as a possible fix.
Please keep in mind:
I first tested this without the patch on 2.9.1, trying to reproduce the numbers from this Using this config for apply rules:
apply Service "test" {
check_command = "dummy"
var my_host = get_host(host_name)
assign where match("test*", host.name)
vars.my_host_override = my_host.vars.my_var
}
apply Service "testi" {
check_command = "dummy"
var my_host = get_host(host_name)
assign where match("test*", host.name)
vars.my_host_override = my_host.vars.my_var
}
apply Service "testo" {
check_command = "dummy"
var my_host = get_host(host_name)
assign where match("test*", host.name)
vars.my_host_override = my_host.vars.my_var
}
apply Service "testa" {
check_command = "dummy"
var my_host = get_host(host_name)
assign where match("test*", host.name)
vars.my_host_override = my_host.vars.my_var
}
and this for flat host_name
:
for (i in range(5000)) {
object Service "test" + i use(i) {
check_command = "dummy"
host_name = "test" + i
var my_host = get_host(host_name)
vars.my_host_override = my_host.vars.my_var
}
object Service "testi" + i use(i) {
check_command = "dummy"
host_name = "test" + i
var my_host = get_host(host_name)
vars.my_host_override = my_host.vars.my_var
}
object Service "testo" + i use(i) {
check_command = "dummy"
host_name = "test" + i
var my_host = get_host(host_name)
vars.my_host_override = my_host.vars.my_var
}
object Service "testa" + i use(i) {
check_command = "dummy"
host_name = "test" + i
var my_host = get_host(host_name)
vars.my_host_override = my_host.vars.my_var
}
}
Both, with 5000 Hosts matching, created the expected 20000 Services with all custom vars in under four seconds. I am therefore unable to reproduce the issue at this moment. Did I miss some important detail?
Edit: The detail I missed is that the Director can not create smart apply rules by itself and would create a single apply rule for each Service, i.e 40000 apply rules. Which do take a bit longer than using host.name ^_^
About fiddle/commit-order by @lazyfrosch: It works! To verify this I used the following config (10k hosts and 70k services):
for (i in range(10000)) {
object Host "test" + i {
check_command = "dummy"
vars.my_var = "abc"
}
for (o in range(7)) {
object Service "test-" + o use(i) {
vars.my_host_override = "nonono"
check_command = "dummy"
host_name = "test" + i
var my_host = get_host(host_name)
vars.my_host_override = my_host.vars.my_var
}
}
}
And compared the outputs of:
$ curl -k -s -u root:icinga 'https://localhost:5665/v1/objects/services' | python -m json.tool | grep "my_host_override" | grep -v "abc"
With 2.9.1 having quite a few results of my_host_override: null
, and the patch having none.
As for performance: There is clear performance loss with the patched version, as you can see in the table below.
((Please ignore, these have been done in a debug build))
Num Obj | 2.8.4 | 2.9.1 | patched |
---|---|---|---|
5000 H 20000 S | 11.47s | 3.10s | 3.97s |
10000 H 40000 S | 22.93s | 6.58s | 7.26s |
10000 H 70000 S | 38.45s | 10.36s | 13.18s |
20000 H 160000 S | 85.04s | 23.23s | 30.64s |
20000 H 300000 S | 159.39s | 44.54s | 58.32s |
Update on this: I'll have another go at it tomorrow. I'm not optimistic we can squeeze out a better performance, but this may not be a heavy change hit if we can resolve #6486
I would be interested in numbers for 2.8.4 too, in order to weigh how much impact this change really has.
I can't easily apply the patch on 2.8.4 since the workqueue was changed a great deal for 2.9.0. So all values would only be comparable to the unpatched 2.9.1
I want to see the raw numbers from 2.8.4 without any patch :)
I just noticed I was doing those tests with debug builds :woman_facepalming:
Updated times! Now with user/system times, unit is seconds:
Objects | 2.8.4 usr | 2.8.4 sys | 2.9.1 usr | 2.9.1 sys | patch usr | patch sys |
---|---|---|---|---|---|---|
5000H 20000S | 3.36 | 0.73 | 3.17 | 0.32 | 3.45 | 0.27 |
10000H 40000S | 6.58 | 1.60 | 6.38 | 0.54 | 7.16 | 0.52 |
10000H 70000S | 10.82 | 2.47 | 10.23 | 0.77 | 12.36 | 0.92 |
20000H 160000S | 24.50 | 5.20 | 23.21 | 1.83 | 29.96 | 2.12 |
20000H 300000S | 42.78 | 9.93 | 42.18 | 3.25 | 55.34 | 3.62 |
The patch increasingly slows down the config check by number of objects. I am worried what it might look like with 15k Hosts and a corresponding number of Services :(
Thanks for the updated numbers. This moves the problem with the scheduled downtimes slowdown into a different scope, as 2.8.4 and 2.9.0 nearly scale the same with just hosts and services. I am also worried about the additional time the patch generates, 13 seconds from the last run mean round about 30% slower.
We need to discuss how to proceed here, e.g. with an attempt of a different patch set or not @bobapple. Meanwhile please focus on other 2.9.2 tasks.
A combined set of fixing the parallel work queue tasks with shuffling (#6581) visible in large scale environments plus enhancing the commit order for dependent objects (#6568) keeps the performance on an acceptable same level compared to 2.8.x. This is something we've agreed on for 2.9.2.
The long term code change should be a defined static load order for config objects, compiled into binary code similar to what we've implemented with the activation_priority already. This removes the additional loops and always ensures a sorted list. In addition to that, technical documentation as well as examples for the commit order should be added to the docs too. Here's the follow-up issue with #6589.
Thanks everyone involved in the analysis, benchmarks and implementation.
So, it took me a while to figure out how service sets in the Director work, and how to enforce this specific behaviour with object rendering. The docs need some love, I was glad to have the Icinga 2 book on my desk :)
I'm using support/2.9 for testing this.
Add host template linux-host
Add host mbmif
Add service templates disk-service
and load-service
Add a new data field called disk_partition
disk-service
mbmif
, click Services
and add a service set, select the above oneService
, pick the generated disk
servicedisk_partition
CV and set a value you can remember for testsObject 'mbmif!disk-service' of type 'Service':
% declared in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 3:1-3:29
* __name = "mbmif!disk-service"
* action_url = ""
* check_command = "disk"
% = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/director-global/service_templates.conf', lines 2:5-2:26
% = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 7:5-7:26
* check_interval = 300
* check_period = ""
* check_timeout = null
* command_endpoint = ""
* display_name = "disk-service"
* enable_active_checks = true
* enable_event_handler = true
* enable_flapping = false
* enable_notifications = true
* enable_passive_checks = true
* enable_perfdata = true
* event_command = ""
* flapping_threshold = 0
* flapping_threshold_high = 30
* flapping_threshold_low = 25
* groups = [ ]
* host_name = "mbmif"
% = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 4:5-4:23
* icon_image = ""
* icon_image_alt = ""
* max_check_attempts = 3
* name = "disk-service"
* notes = ""
* notes_url = ""
* package = "director"
* retry_interval = 60
* source_location
* first_column = 1
* first_line = 3
* last_column = 29
* last_line = 3
* path = "/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf"
* templates = [ "disk-service", "disk-service", "host var overrides (Director)" ]
% = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 3:1-3:29
% = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/director-global/service_templates.conf', lines 1:0-1:30
% = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/director-global/001-director-basics.conf', lines 28:3-28:43
* type = "Service"
* vars
% = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/director-global/001-director-basics.conf', lines 41:7-41:51
* disk_partition = "/overridden-partition-from-service-set"
% = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 8:5-8:32
* volatile = false
* zone = "master"
Copied from https://github.com/Icinga/icingaweb2-module-director/issues/1579#issuecomment-411753307
Tasks
time icinga2 daemon -C
with 2.8, 2.9, possible changes and different object countsTo be updated.
Test Config
likewise the icingadb.conf code slightly adopted for this one, once available.