Data too long for service_state.check_commandline

mrdsam commented 2 days ago

Describe the bug

After upgrading to icingadb-1.2.0_2 (from packages, as usual), icingadb frequently crashes (every 10 to 30 minutes). The error message is always the same, on both nodes (except for "\<n>", which is a always a different number)

Error 1406 (22001): Data too long for column 'check_commandline' at row <n>

See attached log (from start until crash) for details! start-end-1.txt

To Reproduce

apply schema upgrade mysql/upgrades/1.2.0.sql
upgrading to icingadb-1.2.0_2 on both systems
icinadb restart on both systems

After that, the problems started. Anyway, I proceeded with

icingadb stop (both systems)
apply schema upgrade ysql/upgrades/optional/1.2.0-history.sql (took a while, but worked fine)
icingadb start (both systems)

The problmes didn't get away after the optional schema upgrade, as expected.

I then restarted everything: icinga2, redis, icingadb, the MariaDB-Cluster (rolling method, node by node). Problem persists.

Expected behavior

Should not crash with fatal error.

Your Environment

Icinga2 HA-Cluster (with many satellites) on FreeBSD 13.3, with redis and icingadb on each cluster node.

Database is a 3-Node MariaDB Galera-Cluster.

Icinga DB version: 1.2.0
Icinga 2 version: r2.14.3-1
Operating System and version: 13.3-RELEASE-p7
MariaDB: 10.11.9 / Galera 26.4.16

Additional context

Workaround

Here is what I did as a workaround, so I didn't have to rollback the database to it's old state:

Uninstalled icingadb-1.2.0_2
Installed icingadb-1.1.1_14 again
mysql: delete from icingadb_schema where version=5;
service icingadb restart

Since then, everything looks fine.

No problem with postgresql?!

Interesting fact: I also have icingadb 1.2.0 running on some of the satellites; the problem does NOT occur there! Only difference is that I use postgres as database backend there and they are single instance each.

oxzi commented 2 days ago

Thanks for posting this issue.

One related change in 1.2.0 was enabling strict mode for MySQL in #699. Now, MySQL will fail for data being too big instead of silently truncating it. A similar issue was solved in #792, addressing other columns.

However, based on your log, the data for service_state.check_commandline is too big. This column is of type text, allowing 64K characters.

Could you please check your configuration if there are such huge values for some command line?

mrdsam commented 2 days ago

Thank you for your valuable input. Indeed "check_interfaces", which uses the performance data of the last check result to build it's new command line, exceeds this limit on hosts with many interfaces. I removed the checks and now I get...

+-------------------------------------+
| max(char_length(check_commandline)) |
+-------------------------------------+
|                               28213 |
+-------------------------------------+

...what should be safe for now. I'll try 1.2.0 again ASAP and will report the outcome here.

Nevertheless, wouldn't it be nicer to check / truncate the string before trying to write it to the database and issue a warning instead?

oxzi commented 1 day ago

Indeed "check_interfaces", which uses the performance data of the last check result to build it's new command line, exceeds this limit on hosts with many interfaces.

That's quite interesting, as having a total length of arguments exceeding 64K characters is a bit unexpected. Speaking of expectations, that's where the limit most likely came from, as it was there since the beginning and Icinga 2's IDO used the same type.

However, longer arguments are technically possible, as nicely outlined on this webpage which I have referred before. For example, on a Linux I get an ARG_MAX of 2,097,152 and on an OpenBSD there is 524,288.

Now the question is how likely it is to have such huge values, exceeding the TEXT type capacity. As seen in this very issue, it may happen.

However, to fit both ARG_MAX listed above, the type must be at least MEDIUMTEXT. Based on this table from the MySQL documentation, the required storage is based on the actual length of the string plus 2 or 3 bytes for TEXT or MEDIUMTEXT, respectively.

Raising the type should not be too expensive, but nonetheless I would like to have another opinion on that.

Nevertheless, wouldn't it be nicer to check / truncate the string before trying to write it to the database and issue a warning instead?

Definitely something should be done. Unless the type will be increased to MEDIUMTEXT, I would vote for raising a warning as truncating data is dangerous, imo.

mrdsam commented 1 day ago

First, icingadb 1.2.0 is running without problems now, thanks @oxzi for the information about strict mode and the hint that my commands could be too long!

I would not take my case as an argument to increase the size of check_commandline, because monitoring all ports of a 48 port switch without filters or splitting them up is not good practice in the first place.

My most important concern is that icingadb-daemon should keep running in such cases. IMO the best solution would be when icingadb check raises a warn or crit. So things keep running but you get notified that there's something wrong.

Icinga / icingadb