Notification history cleanup takes way too long

Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.

https://icinga.com/docs/icinga2/latest

GNU General Public License v2.0

2.03k stars 578 forks source link

Notification history cleanup takes way too long #7753

Closed Thomas-Gelf closed 2 years ago

Thomas-Gelf commented 4 years ago

idx_statehistory_cleanup exists, but is missing on other tables like icinga_notifications and icinga_contactnot*, maybe others. In a real-life environment raw query time drops from 10 seconds to 0 with 500k involved rows with such an index. Interestingly, when triggered by Icinga itself the very same query has been running for 10+ minutes(!). We haven't been able to figure out why, but anyways - the index is absolutely necessary.

Al2Klimov commented 4 years ago

Yet another reason to wait for IcingaDB (which will make all the things great again). 😉

abaumgaertner commented 4 years ago

while waiting for icingadb - please add these 2 indexes for mysql to resolve this issue.

create index idx_notifications_cleanup on icinga_notifications (instance_id, start_time);
create index idx_contactnotifications_cleanup on icinga_contactnotifications (instance_id, start_time);

"educated" guess for root cause - skip index scan of with unsorted start_time

table icinga_contactnotifications index instance_id table icinga_notifications index notification_idx

Thomas-Gelf commented 4 years ago

@lippserd: the index proposed by @abaumgaertner fixes this issue, and his assumption is correct. icinga_statehistory carries the exact same index for exactly this reason, it's missing on those other tables. Icinga runs this query again and again, probably once a minute or so. This is fine as long as this completes in zero time, but terrible when every query execution takes minutes. I'd really love to see this merged asap, preferably for 2.12.

@Al2Klimov: that's fantastic to hear, I'm looking forward to those great things. Nonetheless, it will take many years unless everybody and his dog migrated to IcingaDB. There is no reason to not fix IDO issues in the meantime. Especially not when we're talking about simple quick wins for obvious error with an impact similar to this one.

Thanks, Thomas

Al2Klimov commented 4 years ago

IMAO it is yet another reason – to speed up the users' migration.

dnsmichi commented 4 years ago

I've talked to @lippserd yesterday and we agreed to approve the index provided by @abaumgaertner. Meaning to say, you are safe to include it in your schema now, the obligatory schema update file will be delivered in 2.13.

For PostgreSQL, we should also investigate on whether the index is required or not.

Cheers, Michael

dnsmichi commented 4 years ago

@htriem since you haven't done this before, here's some notes for implementing this:

Modify mysql.sql
- add the index creation at the bottom
- bump the schema version
Add a new file in the upgrade directory matching the icinga2 version, e.g. 2.13.0.sql
- Adds index creation including the helpers, same as the schema update file 2.11.0.sql
- Bump the IDO schema version to a new minor version

Repeat the same for Postgresql with the same schema version, the index creation may slightly differ.

Test these things with

A mysql schema import from 2.11 sources + the update file for both MySQL and PostgreSQL
A fresh schema import from 2.13 pre for both MySQL and PostgreSQL

I'll assist you with that in the coming weeks when necessary.

lippserd commented 4 years ago

Both indices should be (start_time, instance_id).

Caez83 commented 3 years ago

there is another missing index on the downtime table The where query is on the instance_id and entry_time value but the index is only on another columns :

  KEY `sla_idx_dohist` (`object_id`,`actual_start_time`,`actual_end_time`),
  KEY `idx_downtimehistory_remove` (`object_id`,`entry_time`,`scheduled_start_time`,`scheduled_end_time`),

Another problem is on the cleanup option. When i'm trying to delete a lot of days (About 1 year ) on the icinga_notifications table with a a lot of data , about 30M rows The query doesn't end correctly and it will be restarting each time. After a manualy cleanup the query work fine. I think it's can be great to add limits off number of rows deleting each time or to limite the timeperiod on each query for cleanup task

Al2Klimov commented 3 years ago

@htriem You're not working on this, right?

htriem commented 3 years ago

No.

Al2Klimov commented 2 years ago

At the moment we’re priorizing Icinga DB. So: PRs welcome.

Al2Klimov commented 2 years ago

* I'm looking at you, @abaumgaertner :)

julianbrost commented 2 years ago

Icinga DB was released, so this is unlikely to receive any more attention.

abaumgaertner commented 2 years ago

Since this issue was left open for a long time, we have implemented a local data maintainace solution outside of icinga2 for high troughput systems with large mysql icinga schema (> 100GB). Contact me if you are interested.