Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Slow and extremely huge commenthistory table on large datasets with MySQL #5498

Closed robbsie closed 6 years ago

robbsie commented 7 years ago

The auto deletion of old comments, which is triggered automatically every few seconds or even more often (don't know from where exactly, but we saw this permanently on our MySQL proc stat) is extremely slow with a large commenthistory.

The query looks like this:

DELETE FROM icinga_commenthistory WHERE instance_id = 1 AND entry_time < FROM_UNIXTIME(1499787760);

Reason seems to be a missing index on entry_time so this does always a full table scan which took ~30s on our commenthistory table with 7mio+ entrys.

After adding an index to the entry_time row the problem should be gone:

ALTER TABLE icinga_commenthistory ADD INDEX(entry_time);

The other point is, that the commenthistory table is incredible large! We've set commenthistory_age = 30d and got more than 7mio entry in this table. The real number of made comments is only a few hundred or something.

It seems every Icinga2 reload fills the table with new data from the files stored in local disc, so we have all entrys many multiple times there. Is this normal and wanted behavior?

lazyfrosch commented 7 years ago

So DbConnection::CleanUpHandler seems to use unindexed queries, which is pretty bad.

Comments and general history grow pretty huge, especially on larger systems.

commenthistory only has one Index on MySQL:

  UNIQUE KEY instance_id (instance_id,object_id,comment_time,internal_comment_id)

Same for PgSQL:

  CONSTRAINT UQ_commenthistory UNIQUE (instance_id,object_id,comment_time,internal_comment_id)

@N-o-X something for you? We should make sure that the other tables we cleanup also have an index on entry_time

dnsmichi commented 7 years ago

We'll discuss this with @lippserd as there might be more schema fixes in terms of performance coming.

Thomas-Gelf commented 7 years ago

All who are involved, please make one step back and try to look at the whole picture. There is nothing wrong with trying to optimize the DB for millions of comment history entries. Just, it makes no sense. The original comment points to the main issue:

We've set commenthistory_age = 30d and got more than 7mio entry in this table. The real number of made comments is only a few hundred or something.

This should tell us that this table contains 99% garbage. No need for tuning here, please find out where the waste comes from.

Cheers, Thomas

lazyfrosch commented 7 years ago

@Thomas-Gelf valid point, technically 2 issues here...

robbsie commented 7 years ago

We figured out that 6,9mio entrys (!) are quite the same and have been made within 3 hours:

mysql> select COUNT(*) from icinga_commenthistory  where name="<servername>!Elasticsearch 2local!monitor.<our_hostname>.de-1501852997-2";

+----------+
| COUNT(*) |
+----------+
|  6963264 |
+----------+
1 row in set (0.00 sec)

We have now cleared the table but made a dump for a deeper look, if needed. Now there are only 417 valid comments left in this table.

We have no idea why the table was so huge. Maybe it has something to do with cluster logic, because the table was so huge on both cluster nodes.

dnsmichi commented 6 years ago

I would believe this originates from an old bug in < 2.8 where the cluster downtime object creation plays "ping pong" (seen and fixed at a customer's site). Manual cleanup is the only possible method here. For the cleanup handlers - if they happen to be wrong, please open a separate issue.

Cheers, Michael