Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.01k stars 577 forks source link

Make replay log content configurable #7625

Closed widhalmt closed 4 years ago

widhalmt commented 4 years ago

Hi,

While this request might sound weird in the first place it could be helpful with huge setups, especially when thinking about IcingaDB. Maybe it's total BS but at least it should be considered as an option.

Today we tested at a reasonably sized setup (about 70k services) how fast the API log grows when one of the two masters is down.

We talked some time about the content of the log and about what goes into the database and so on.

What we came up with was the idea to make syncing of states and state changes optional and just leave the items which are triggered from outside (downtimes, acknowledgements, sent notifications etc). This would reduce the size of the api log significantly.

Of course there are some pitfalls like what reply the freshly restarted master should give via API (or in Redis) before it reran some checks. Maybe sync the current state? Forward the request? Reply with "I'm not fully resynced, yet."?

I consider this as some sort of "expert feature" which should not be promoted a lot because you definitely can break things with it. But maybe keeping it out of the shipped default config and having warnings in the documentation might be enough?

Cheers, Thomas

widhalmt commented 4 years ago

ref/NC/631975

dnsmichi commented 4 years ago

You're talking about the replay log here, right?

widhalmt commented 4 years ago

Yes. /var/lib/icinga2/api/log

dnsmichi commented 4 years ago

I don't know whether we can remove certain events as it would break the cluster, or endpoints expecting them actually. The storage format should be changed in the long term, removing bottlenecks. That's to be discussed / prioritized with @lippserd then.

Al2Klimov commented 4 years ago

@N-o-X @lippserd What about letting the user configuring blacklists of patterns per endpoint, i.e.:

object Endpoint "master1" {
  replay_if = m => !match("event::*", m.method)
}

@widhalmt @lippserd Will the customer pay for this?

lippserd commented 4 years ago

This is about the growth of the API log if I'm not mistaken and that should be fixed with a proper implementation of our replay log where we also have to check the necessity of each and every cluster message. The replay log is an internal feature and must not be configurable in any way.

lippserd commented 4 years ago

7752

widhalmt commented 4 years ago

Yes, the whole point of this issue was a customers project, @lippserd . They have a rather huge setup and they have to deal with frequent connection loss. So they wanted to explicitly forgo sending of state changes and just keep things like downtimes or other events sent via API.

The whole purpose was to reduce the size of the whole replay log to an absolute minimum.

But I can very well imagine that that would break the whole cluster communication, so I'm completely ok with this not being implemented. And I think the same goes for the customer.

Thanks for evaluating, though.