Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.65k stars 406 forks source link

Moving hosts between data collectors is slow #2978

Closed eschoeller closed 5 years ago

eschoeller commented 5 years ago

Not sure if there's anything that can be done for this, but it takes a very long time to move multiple devices between data collectors. On the device Management page I select numerous devices and then select the "Change Device Settings" option. I then change the Poller Association. There is a query that runs for every single device that takes 6 seconds to run:

# Time: 190925 18:59:04
# User@Host: apache[apache] @ localhost []  Id: 16810339
# Query_time: 5.228184  Lock_time: 0.000449 Rows_sent: 191  Rows_examined: 1031541
SET timestamp=1569459544;
SELECT did.* FROM data_input_data AS did INNER JOIN data_template_data AS dtd ON did.data_template_data_id=dtd.id INNER JOIN data_local AS dl ON dl.id=dtd.local_data_id INNER JOIN host AS h ON h.id=dl.host_id WHERE h.id = '728';

So, moving 50 devices from one collector to another takes a very long time.

cigamit commented 5 years ago

Run an explain on that same query:

explain SELECT did.* FROM data_input_data AS did INNER JOIN data_template_data AS dtd ON did.data_template_data_id=dtd.id INNER JOIN data_local AS dl ON dl.id=dtd.local_data_id INNER JOIN host AS h ON h.id=dl.host_id WHERE h.id = '728';
eschoeller commented 5 years ago
+----+-------------+-------+--------+-------------------------------+---------+---------+---------------------------------+--------+-------------+
| id | select_type | table | type   | possible_keys                 | key     | key_len | ref                             | rows   | Extra       |
+----+-------------+-------+--------+-------------------------------+---------+---------+---------------------------------+--------+-------------+
|  1 | SIMPLE      | h     | const  | PRIMARY                       | PRIMARY | 3       | const                           |      1 | Using index |
|  1 | SIMPLE      | did   | ALL    | NULL                          | NULL    | NULL    | NULL                            | 446616 | NULL        |
|  1 | SIMPLE      | dtd   | eq_ref | PRIMARY,local_data_id         | PRIMARY | 3       | cacti.did.data_template_data_id |      1 | NULL        |
|  1 | SIMPLE      | dl    | eq_ref | PRIMARY,host_id_snmp_query_id | PRIMARY | 3       | cacti.dtd.local_data_id         |      1 | Using where |
+----+-------------+-------+--------+-------------------------------+---------+---------+---------------------------------+--------+-------------+
4 rows in set (0.00 sec)
cigamit commented 5 years ago

Well, that is a problem.

cigamit commented 5 years ago

Try this, and then run that query again:

ALTER TABLE data_input_data ADD INDEX data_template_data_id (data_template_data_id);
eschoeller commented 5 years ago
mysql> ALTER TABLE data_input_data ADD INDEX data_template_data_id (data_template_data_id);
Query OK, 0 rows affected (3.91 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain SELECT did.* FROM data_input_data AS did INNER JOIN data_template_data AS dtd ON did.data_template_data_id=dtd.id INNER JOIN data_local AS dl ON dl.id=dtd.local_data_id INNER JOIN host AS h ON h.id=dl.host_id WHERE h.id = '728';
+----+-------------+-------+-------+-------------------------------+-----------------------+---------+--------------+------+--------------------------+
| id | select_type | table | type  | possible_keys                 | key                   | key_len | ref          | rows | Extra                    |
+----+-------------+-------+-------+-------------------------------+-----------------------+---------+--------------+------+--------------------------+
|  1 | SIMPLE      | h     | const | PRIMARY                       | PRIMARY               | 3       | const        |    1 | Using index              |
|  1 | SIMPLE      | dl    | ref   | PRIMARY,host_id_snmp_query_id | host_id_snmp_query_id | 3       | const        |   25 | Using where; Using index |
|  1 | SIMPLE      | dtd   | ref   | PRIMARY,local_data_id         | local_data_id         | 3       | cacti.dl.id  |    1 | Using index              |
|  1 | SIMPLE      | did   | ref   | data_template_data_id         | data_template_data_id | 3       | cacti.dtd.id |    4 | NULL                     |
+----+-------------+-------+-------+-------------------------------+-----------------------+---------+--------------+------+--------------------------+
4 rows in set (0.00 sec)
cigamit commented 5 years ago

No, just run that same query without the explain. How long?

eschoeller commented 5 years ago

191 rows in set (0.01 sec)

cigamit commented 5 years ago

Bingo! Now, try to move a Device.

eschoeller commented 5 years ago

Yup. Moved 47 devices, nothing in the mysql-slow.log and it completed much faster. Cool! So, as per usual, somehow I missed a database update along the way ...

cigamit commented 5 years ago

Nope, I think we optimized out some indexes thinking things like just having the primary hold the value would be fast. However, that's not how it works.

eschoeller commented 5 years ago

Sadly, I think this broke my "hacked" version of graph_image.php that I use for some dashboards. I'm poking around at it right now.

cigamit commented 5 years ago

Keep us posted. I doubt the index change had anything to do with it.

eschoeller commented 5 years ago

Somehow it did. The dashboard went blank right after I ran that alter command. Or maybe it was the two boost updates ... but I find that even more remote of a possibility. Beyond me why at this point. But, this version of graph_image.php was from 2013. After every upgrade I've been shocked that it continues to work. I just copied over the most current graph_image and made two small changes to restore it. Works just fine, so this is for the better in the end. One of those ticking timebombs:)

eschoeller commented 5 years ago

Another thing I noticed... for each device that is moved, I see this activity:

2019/09/25 20:46:20 - WEBUI NOTE: Table host Replicated to Poller With 1 Rows Updated
2019/09/25 20:46:20 - WEBUI NOTE: Table host_snmp_cache Replicated to Poller With 92 Rows Updated
2019/09/25 20:46:20 - WEBUI NOTE: Table data_local Replicated to Poller With 85 Rows Updated
2019/09/25 20:46:20 - WEBUI NOTE: Table graph_local Replicated to Poller With 77 Rows Updated
2019/09/25 20:46:20 - WEBUI NOTE: Table data_template_data Replicated to Poller With 85 Rows Updated
2019/09/25 20:46:20 - WEBUI NOTE: Table data_template_rrd Replicated to Poller With 319 Rows Updated
2019/09/25 20:46:21 - WEBUI NOTE: Table graph_templates_item Replicated to Poller With 1114 Rows Updated
2019/09/25 20:46:21 - WEBUI NOTE: Table data_input_data Replicated to Poller With 846 Rows Updated

Seems like perhaps the replication should happen just once at the end, instead of for each device?

cigamit commented 5 years ago

Hmm. I'll have to check. I think it's just be replicating the impacted rows. I guess we should turn down the logging a bit.

Log a separate issue so that we don't loose it.