Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 573 forks source link

Feature Request: Combine remote command endpoint cluster messages #7021

Open Obihoernchen opened 5 years ago

Obihoernchen commented 5 years ago

Often a lot of services for a given endpoint have the same check interval. For instance there is a host with 20 services attached to it. All services have the same check interval of 1 minute. Of course there might be other services with different check intervals but often check intervals are the same at least for some services.

Current Behavior

Currently Icinga2 will send 20 cluster messages to the remote endpoint every minute to get the results. Having a lot of hosts with a lot of services attached to a satellite node slows down the API significantly. ~This is also related to the current API issues which should be improved/fixed with the new 2.11 network stack, but this idea might be interesting anyways.~ see comments below.

Expected Behavior + Possible Solution

In my opinion it would be a nice feature if Icinga 2 tries to combine these 20 cluster messages to a single "batch" cluster message. So all services of a host with the same check interval could be combined to a single cluster message. Or if you want to go one step further you could even try to combine services with 1 minute, 2 minute (every 2nd API call) check intervals etc. I think such a feature lowers the number of cluster messages significantly and improves scalability. Edit better description from @dnsmichi:

Possible Issues

This probably needs to be an opt-in feature because it affects timeouts and check duration.

Your Environment

@dnsmichi We talked about this at Icinga Camp Berlin 2019 (Markus) ;-)

Al2Klimov commented 5 years ago

Note that the network stack (#7005) has nothing to do with scalability like requests/s. It's about not denying the service in general in a large environment.

Al2Klimov commented 5 years ago

@Obihoernchen Have you tried to re-use HTTP/1.1 connections? This would increase requests/s as due to not done TLS handshakes.

dnsmichi commented 5 years ago

I think that's the one feature we've talked about at Icinga Camp Berlin, but the description is a bit irritating. It is not about API calls, but cluster messages fired for command endpoint checks.

The logic should be sort of

I'm not sure how to exactly achieve this, but it sounds like a good idea to discuss.

Cheers, Michael

Al2Klimov commented 5 years ago

Not sure how much speed this will add (especially after #7005)... but the proof of the pudding is in the eating.

htriem commented 4 years ago

7160 might be relevant for this. @lippserd

julianbrost commented 2 years ago

Has anyone ever done some profiling on this? I think, the overhead of JSON-RPC itself shouldn't be too high to make a significant difference so that combining messages makes a significant difference (if it does, maybe that should be optimized).

I think it's more likely that the actual action behind the JSON-RPC message is expensive, and all actions would still have to be split into the individual checkables, so I have little hope that this change would make a huge improvement.

But all of this is my gut feeling, so prove me wrong if you like :)