Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 573 forks source link

Evaluate a fixed config compiler commit order #6540

Closed dnsmichi closed 6 years ago

dnsmichi commented 6 years ago

Copied from https://github.com/Icinga/icingaweb2-module-director/issues/1579#issuecomment-411753307

Hi,

We discussed this today and agreed that it is best to make get_host() working again. Though we may not have designed those accessor functions for this use case in the first place, it used to work without any problems in the past. At least, we did not receive any issue reports until now. Switching the Director to use Apply Rules is not really an option because of the huge performance impact. In our discussion we evaluated some other solutions which we may introduce in the future. For now they are just here for reference:

  1. get_host(), get_service() and alike should reliably work in our DSL. This could be achieved by using futures.

  2. Improve the performance of Apply Rules by optimizing specific lookups, e.g. host.name == x.

Thanks to all parties to sort this out. Of course we try our best to make get_host() working again asap.

Cheers, Eric

Tasks

To be updated.

Test Config

object Host "test" {
  check_command = "dummy"
  vars.my_var = "abc"
}
object Service "test" {
  check_command = "dummy"
  host_name = "test"
  var my_host = get_host(host_name)

  vars.my_host_override = my_host.vars.my_var
}

likewise the icingadb.conf code slightly adopted for this one, once available.

Thomas-Gelf commented 6 years ago

@lippserd: could you please schedule the essential part of this (get get_host() working as before) for v2.9.2? Related issues are #6522 and Director #1579.

gunnarbeutner commented 6 years ago

The problem appears to be how ConfigItem::CommitNewItems commits items:

Lines 405 through 435 generate a list of items that are to be committed. By virtue of how elements are ordered in the std::map m_Items you end up with a list that's alphabetically sorted by the type.

E.g.:

[ \<all hosts>, ..., \<all services> ]

(because 'H' comes before 'S' in the alphabet).

And that's why that worked in \<2.9. I.e., mostly by accident. It was never guaranteed to behave like this, but it did.

2.9 - or more specifically d9010c7b9f - introduced ParallelFor() for work queues which made the order non-deterministic.

lazyfrosch commented 6 years ago

I suggest 7e0fb2dc3dc335c56668cb21679b21905a710bbb fiddle/commit-order as a possible fix.

Please keep in mind:

Crunsher commented 6 years ago

I first tested this without the patch on 2.9.1, trying to reproduce the numbers from this Using this config for apply rules:

apply Service "test" {
      check_command = "dummy"
      var my_host = get_host(host_name)

      assign where match("test*", host.name)
      vars.my_host_override = my_host.vars.my_var
}

apply Service "testi" {
      check_command = "dummy"
      var my_host = get_host(host_name)

      assign where match("test*", host.name)
      vars.my_host_override = my_host.vars.my_var
}

apply Service "testo" {
      check_command = "dummy"
      var my_host = get_host(host_name)

      assign where match("test*", host.name)
      vars.my_host_override = my_host.vars.my_var
}

apply Service "testa" {
      check_command = "dummy"
      var my_host = get_host(host_name)

      assign where match("test*", host.name)
      vars.my_host_override = my_host.vars.my_var
}

and this for flat host_name:

for (i in range(5000)) {
    object Service "test" + i use(i) {
      check_command = "dummy"
      host_name = "test" + i
      var my_host = get_host(host_name)

      vars.my_host_override = my_host.vars.my_var
    }
    object Service "testi" + i use(i) {
      check_command = "dummy"
      host_name = "test" + i
      var my_host = get_host(host_name)

      vars.my_host_override = my_host.vars.my_var
    }
    object Service "testo" + i use(i) {
      check_command = "dummy"
      host_name = "test" + i
      var my_host = get_host(host_name)

      vars.my_host_override = my_host.vars.my_var
    }
    object Service "testa" + i use(i) {
      check_command = "dummy"
      host_name = "test" + i
      var my_host = get_host(host_name)

      vars.my_host_override = my_host.vars.my_var
    }
}

Both, with 5000 Hosts matching, created the expected 20000 Services with all custom vars in under four seconds. I am therefore unable to reproduce the issue at this moment. Did I miss some important detail?

Edit: The detail I missed is that the Director can not create smart apply rules by itself and would create a single apply rule for each Service, i.e 40000 apply rules. Which do take a bit longer than using host.name ^_^

Crunsher commented 6 years ago

About fiddle/commit-order by @lazyfrosch: It works! To verify this I used the following config (10k hosts and 70k services):

for (i in range(10000)) {
    object Host "test" + i { 
        check_command = "dummy"
        vars.my_var = "abc"
    }   

    for (o in range(7)) {
        object Service "test-" + o use(i) {
            vars.my_host_override = "nonono"
            check_command = "dummy"
            host_name = "test" + i 
            var my_host = get_host(host_name)

            vars.my_host_override = my_host.vars.my_var
        }
    }   
}

And compared the outputs of:

$ curl -k -s -u root:icinga 'https://localhost:5665/v1/objects/services' | python -m json.tool | grep "my_host_override" | grep -v "abc"

With 2.9.1 having quite a few results of my_host_override: null, and the patch having none. As for performance: There is clear performance loss with the patched version, as you can see in the table below.

((Please ignore, these have been done in a debug build))

Num Obj 2.8.4 2.9.1 patched
5000 H 20000 S 11.47s 3.10s 3.97s
10000 H 40000 S 22.93s 6.58s 7.26s
10000 H 70000 S 38.45s 10.36s 13.18s
20000 H 160000 S 85.04s 23.23s 30.64s
20000 H 300000 S 159.39s 44.54s 58.32s
Crunsher commented 6 years ago

Update on this: I'll have another go at it tomorrow. I'm not optimistic we can squeeze out a better performance, but this may not be a heavy change hit if we can resolve #6486

dnsmichi commented 6 years ago

I would be interested in numbers for 2.8.4 too, in order to weigh how much impact this change really has.

Crunsher commented 6 years ago

I can't easily apply the patch on 2.8.4 since the workqueue was changed a great deal for 2.9.0. So all values would only be comparable to the unpatched 2.9.1

dnsmichi commented 6 years ago

I want to see the raw numbers from 2.8.4 without any patch :)

Crunsher commented 6 years ago

I just noticed I was doing those tests with debug builds :woman_facepalming:

Crunsher commented 6 years ago

Updated times! Now with user/system times, unit is seconds:

Objects 2.8.4 usr 2.8.4 sys 2.9.1 usr 2.9.1 sys patch usr patch sys
5000H 20000S 3.36 0.73 3.17 0.32 3.45 0.27
10000H 40000S 6.58 1.60 6.38 0.54 7.16 0.52
10000H 70000S 10.82 2.47 10.23 0.77 12.36 0.92
20000H 160000S 24.50 5.20 23.21 1.83 29.96 2.12
20000H 300000S 42.78 9.93 42.18 3.25 55.34 3.62

The patch increasingly slows down the config check by number of objects. I am worried what it might look like with 15k Hosts and a corresponding number of Services :(

dnsmichi commented 6 years ago

Thanks for the updated numbers. This moves the problem with the scheduled downtimes slowdown into a different scope, as 2.8.4 and 2.9.0 nearly scale the same with just hosts and services. I am also worried about the additional time the patch generates, 13 seconds from the last run mean round about 30% slower.

We need to discuss how to proceed here, e.g. with an attempt of a different patch set or not @bobapple. Meanwhile please focus on other 2.9.2 tasks.

dnsmichi commented 6 years ago

A combined set of fixing the parallel work queue tasks with shuffling (#6581) visible in large scale environments plus enhancing the commit order for dependent objects (#6568) keeps the performance on an acceptable same level compared to 2.8.x. This is something we've agreed on for 2.9.2.

The long term code change should be a defined static load order for config objects, compiled into binary code similar to what we've implemented with the activation_priority already. This removes the additional loops and always ensures a sorted list. In addition to that, technical documentation as well as examples for the commit order should be added to the docs too. Here's the follow-up issue with #6589.

Thanks everyone involved in the analysis, benchmarks and implementation.

dnsmichi commented 5 years ago

Director Tests

So, it took me a while to figure out how service sets in the Director work, and how to enforce this specific behaviour with object rendering. The docs need some love, I was glad to have the Icinga 2 book on my desk :)

I'm using support/2.9 for testing this.

Steps

screen shot 2018-09-25 at 16 55 36

screen shot 2018-09-25 at 16 54 58

screen shot 2018-09-25 at 15 53 41

screen shot 2018-09-25 at 16 36 50

screen shot 2018-09-25 at 16 51 47

Result

Object 'mbmif!disk-service' of type 'Service':
  % declared in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 3:1-3:29
  * __name = "mbmif!disk-service"
  * action_url = ""
  * check_command = "disk"
    % = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/director-global/service_templates.conf', lines 2:5-2:26
    % = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 7:5-7:26
  * check_interval = 300
  * check_period = ""
  * check_timeout = null
  * command_endpoint = ""
  * display_name = "disk-service"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "mbmif"
    % = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 4:5-4:23
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 3
  * name = "disk-service"
  * notes = ""
  * notes_url = ""
  * package = "director"
  * retry_interval = 60
  * source_location
    * first_column = 1
    * first_line = 3
    * last_column = 29
    * last_line = 3
    * path = "/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf"
  * templates = [ "disk-service", "disk-service", "host var overrides (Director)" ]
    % = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 3:1-3:29
    % = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/director-global/service_templates.conf', lines 1:0-1:30
    % = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/director-global/001-director-basics.conf', lines 28:3-28:43
  * type = "Service"
  * vars
    % = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/director-global/001-director-basics.conf', lines 41:7-41:51
    * disk_partition = "/overridden-partition-from-service-set"
      % = modified in '/usr/local/icinga2/var/lib/icinga2/api/packages/director/db5b8d91-5ce7-43ff-a55e-de0abd5f9336/zones.d/master/servicesets.conf', lines 8:5-8:32
  * volatile = false
  * zone = "master"