Closed Al2Klimov closed 2 months ago
Good catch! I completely forgot the GHA!
Then, why is the username
column of contact
not case insensitive in neither schema yet?
Because it's not mandatory for porting the schema. Just an accidental finding.
Then open a bug, referencing the Web issue. Don't reference it here, if you don't plan to change it here.
Whether I change something here depends on whether this PR or https://github.com/Icinga/icinga-notifications-web/issues/196 is completed first.
That's fine. But then create a separate issue! It doesn't magically change otherwise. This PR is about the MySQL port, not about fixes in the Postgres schema.
Why are some tables created with ROW_FORMAT=DYNAMIC
and others without?
Otherwise you get this error:
$ perl -pi -e 's/ ROW_FORMAT=DYNAMIC//' schema/mysql/schema.sql
$ mysql -h127.0.0.1 -P6233 -uroot -p1 idb <schema/mysql/schema.sql
ERROR 1709 (HY000) at line 1: Index column size too large. The maximum column size is 767 bytes.
But it's not always necessary, so I don't always specify it. (https://github.com/Icinga/icingaweb2/pull/5134#issuecomment-2043145062)
With which MySQL/MariaDB version did you test that? Is that version specific? With MariaDB 11.4.2, I could import the schema without issues after removing all ROW_FORMAT=DYNAMIC
. That also matches my expectation if ROW_FORMAT=DYNAMIC
is the default (as you said in https://github.com/Icinga/icingaweb2/pull/5134#issuecomment-2044479106 and is written in the MySQL documentation for at least versions 5.7 and 9.0). Was that different in even older versions?
Exactly. It doesn't work for me up to incl. v5.6.
That would have been easier if you said that from the beginning.
From the MySQL 5.7.9 release notes:
Important Change; InnoDB: DYNAMIC replaces COMPACT as the implicit default row format for InnoDB tables. A new configuration option, innodb_default_row_format, specifies the default InnoDB row format. Permitted values include DYNAMIC (the default), COMPACT, and REDUNDANT.
The COMPACT row format remained the default row format until this release to ensure compatibility with older versions of InnoDB in MySQL 5.1 and earlier. Now that MySQL 5.1 has reached the end of its product lifecycle, the newer DYNAMIC row format becomes the default. For information about advantages of the DYNAMIC row format, see DYNAMIC Row Format.
Newly created tables use the row format defined by innodb_default_row_format when a ROW_FORMAT option is not specified explicitly or when ROW_FORMAT=DEFAULT is used.
Existing tables retain their current row format if a ROW_FORMAT option was specified explicitly. If a ROW_FORMAT option was not specified explicitly or if ROW_FORMAT=DEFAULT was used, any operation that rebuilds a table also silently changes the row format of the table to the format defined by innodb_default_row_format. For more information, see Defining the Row Format of a Table. (WL #8307)
tl;dr: 5.7.9 changed the default from ROW_FORMAT=COMPACT
to ROW_FORMAT=DYNAMIC
.
So why don't we want to specify ROW_FORMAT=DYNAMIC
everywhere? Is there a specific reason that one wants to avoid this row format with older versions if possible? Otherwise, when upgrading the MySQL version, this would keep the COMPACT
format until some operation rebuilds the table anyways, which differs from the format that would be used if the table was initially created with the newer MySQL version.
My motivation was:
It is not needed. And consistency is no reason to add it. New code should reflect how it's done properly and not how it was done ages ago.
– @lippserd https://github.com/Icinga/icingaweb2/pull/5134#issuecomment-2044228663
But now as you say it:
So why don't we want to specify
ROW_FORMAT=DYNAMIC
everywhere? (...) Otherwise, when upgrading the MySQL version, this would keep theCOMPACT
format until some operation rebuilds the table anyways, which differs from the format that would be used if the table was initially created with the newer MySQL version.
This sounds legit (and was even a reason for #226).
If this raises the minimal required version (as we discussed), then it has to update it in the documentation as well: https://github.com/Icinga/icinga-notifications/blob/59861289ce06d4b264f1b42e941d3b520cba332a/doc/02-Installation.md#L24
Continuing https://github.com/Icinga/icinga-notifications/pull/203#pullrequestreview-2180268854.
Turns out, the DELETE
query sets the lock. When removing it, my test does not run into a deadlock.
To be continued... again.
Continuing https://github.com/Icinga/icinga-notifications/pull/203#pullrequestreview-2180268854, https://github.com/Icinga/icinga-notifications/pull/203#issuecomment-2231175110.
The object_extra_tag
table has a composite primary key of the object_id
and tag
columns, being a binary(32)
and varchar(255)
respectively.
A SELECT
query with a WHERE
clause against a specific object_id
results in a ref query. However, when creating a DELETE
query with the same clause, range is chosen. Note that the latter has no ref
.
MariaDB [notifications]> explain select * from object_extra_tag where object_id = 0xc08011a318c4479d2db5b9268b3ff0b7e7c3ea53b2da1aaaabb3378b4d98;
+------+-------------+------------------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------------+------+---------------+---------+---------+-------+------+-------------+
| 1 | SIMPLE | object_extra_tag | ref | PRIMARY | PRIMARY | 32 | const | 1 | Using where |
+------+-------------+------------------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.001 sec)
MariaDB [notifications]> explain delete from object_extra_tag where object_id = 0xc08011a318c4479d2db5b9268b3ff0b7e7c3ea53b2da1aaaabb3378b4d98;
+------+-------------+------------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | object_extra_tag | range | PRIMARY | PRIMARY | 32 | NULL | 1 | Using where |
+------+-------------+------------------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.001 sec)
While I am not completely certain, I would guess based on SHOW ENGINE INNODB STATUS
that the DELETE
query creates a next-key lock. The locked interval seems to start from the largest value up to infinity, also known as supremum.
RECORD LOCKS space id 4595 page no 3 n bits 8 index PRIMARY of table "notifications"."object_extra_tag" trx id 696018 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
After some tinkering, I was able to reproduce this with to conflicting sessions. First, I selected the maximum object_id
to insert two values afterwards, to run into the supremum interval.
MariaDB [notifications]> select max(id) from object;
+--------------------------------------------------------------------+
| max(id) |
+--------------------------------------------------------------------+
| 0xFF5916B73A281B19D0F57696F8973FBF8AD6A7DFB85EF62CD6A64641E4DA2DFF |
+--------------------------------------------------------------------+
1 row in set (0.000 sec)
Transaction 1 | Transaction 2 |
---|---|
start transaction; |
|
start transaction; |
|
insert into object (id, source_id, name) values (0xFF5916B73A281B19D0F57696F8973FBF8AD6A7DFB85EF62CD6A64641E4DAF00F, 1, 'foo'); |
|
insert into object (id, source_id, name) values (0xFF5916B73A281B19D0F57696F8973FBF8AD6A7DFB85EF62CD6A64641E4DAF0FF, 1, 'bar'); |
|
delete from object_extra_tag where object_id = 0xFF5916B73A281B19D0F57696F8973FBF8AD6A7DFB85EF62CD6A64641E4DAF00F; |
|
delete from object_extra_tag where object_id = 0xFF5916B73A281B19D0F57696F8973FBF8AD6A7DFB85EF62CD6A64641E4DAF0FF; |
|
insert into object_extra_tag values (0xFF5916B73A281B19D0F57696F8973FBF8AD6A7DFB85EF62CD6A64641E4DAF00F, 'foo', ''); |
|
insert into object_extra_tag values (0xFF5916B73A281B19D0F57696F8973FBF8AD6A7DFB85EF62CD6A64641E4DAF0FF, 'foo', ''); ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction |
Furthermore, I tried adding a new INDEX
on object_extra_tag(object_id)
, but this changed nothing. I tried some other options, but those only made things worse. I even achieved an ALL
type once.
For comparison, the query plan generated by PostgreSQL looks identical, at least for the selection part.
notifications=# explain select * from object_extra_tag where object_id = '\xefb1df788174bb134c82056ce6a850916e030ed0aefe1f2d7be659e5c1bcf131';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on object_extra_tag (cost=4.45..23.19 rows=5 width=56)
Recheck Cond: (object_id = '\xefb1df788174bb134c82056ce6a850916e030ed0aefe1f2d7be659e5c1bcf131'::bytea)
-> Bitmap Index Scan on pk_object_extra_tag (cost=0.00..4.45 rows=5 width=0)
Index Cond: (object_id = '\xefb1df788174bb134c82056ce6a850916e030ed0aefe1f2d7be659e5c1bcf131'::bytea)
(4 rows)
notifications=# explain delete from object_extra_tag where object_id = '\xefb1df788174bb134c82056ce6a850916e030ed0aefe1f2d7be659e5c1bcf131';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Delete on object_extra_tag (cost=4.45..23.19 rows=0 width=0)
-> Bitmap Heap Scan on object_extra_tag (cost=4.45..23.19 rows=5 width=6)
Recheck Cond: (object_id = '\xefb1df788174bb134c82056ce6a850916e030ed0aefe1f2d7be659e5c1bcf131'::bytea)
-> Bitmap Index Scan on pk_object_extra_tag (cost=0.00..4.45 rows=5 width=0)
Index Cond: (object_id = '\xefb1df788174bb134c82056ce6a850916e030ed0aefe1f2d7be659e5c1bcf131'::bytea)
(5 rows)
If MySQL isn't able to handle what we do in concurrent transactions, we have to change something anyways. Even without understanding the exact cause yet, I think there are some obvious things that we can try:
and "tag" not in (?, ?, ...)
to the delete
statement (however, I have low hopes for that one).If MySQL isn't able to handle what we do in concurrent transactions, we have to change something anyways. Even without understanding the exact cause yet, I think there are some obvious things that we can try:
1. The transaction currently isn't retried on such a failure. That can be added. That sounds like something we'd want anyway but of course it would be nicer if we wouldn't have to rely on this often.
Even though I am not the biggest fan of retrying everything, it might be a strategy here. Especially since situations like this could be lurking anywhere in the codebase.
2. The current implementation was pretty much the easiest one to update the extra tags: delete all existing ones of the object and then (re)insert all. Alternative implementations could be tried so experimentally check if they show better behavior. Something like reading what's currently stored and just applying the delta. Or maybe add something like `and "tag" not in (?, ?, ...)` to the `delete` statement (however, I have low hopes for that one).
After brainstorming together with @yhabteab, he suggested the same. Unfortunately, this does not worked either.
Thus, I just moved the logic from the database into the code, which worked, but isn't something I am proud of.
diff --git a/internal/object/object.go b/internal/object/object.go
index fc15c26..3a3e9c9 100644
--- a/internal/object/object.go
+++ b/internal/object/object.go
@@ -114,14 +114,26 @@ func FromEvent(ctx context.Context, db *database.DB, ev *event.Event) (*Object,
return nil, fmt.Errorf("failed to upsert object id tags: %w", err)
}
- extraTag := &ExtraTagRow{ObjectId: newObject.ID}
- _, err = tx.NamedExecContext(ctx, `DELETE FROM "object_extra_tag" WHERE "object_id" = :object_id`, extraTag)
+ var tags []string
+ stmt = db.Rebind(`SELECT "tag" FROM "object_extra_tag" WHERE "object_id" = ?`)
+ err = tx.SelectContext(ctx, &tags, stmt, newObject.ID)
if err != nil {
- return nil, fmt.Errorf("failed to delete object extra tags: %w", err)
+ return nil, fmt.Errorf("failed to fetch object_extra_tags: %w", err)
+ }
+
+ stmt = db.Rebind(`DELETE FROM "object_extra_tag" WHERE "object_id" = ? AND "tag" = ?`)
+ for _, tag := range tags {
+ if _, ok := ev.ExtraTags[tag]; ok {
+ continue
+ }
+ _, err := tx.ExecContext(ctx, stmt, newObject.ID, tag)
+ if err != nil {
+ return nil, fmt.Errorf("failed to delete object extra tags: %w", err)
+ }
}
if len(ev.ExtraTags) > 0 {
- stmt, _ := db.BuildInsertStmt(extraTag)
+ stmt, _ := db.BuildUpsertStmt(&ExtraTagRow{ObjectId: newObject.ID})
_, err = tx.NamedExecContext(ctx, stmt, mapToTagRows(newObject.ID, ev.ExtraTags))
if err != nil {
return nil, fmt.Errorf("failed to insert object extra tags: %w", err)
Edit: This might just work because the DELETE
query is never being executed. In our test cases, the extra tags do not change.. D'oh!
Edit: This might just work because the
DELETE
query is never being executed. In our test cases, the extra tags do not change.. D'oh!
In general, those should stay unchanged most of the time. So in combination with retrying on deadlocks, this might be good enough.
Edit: This might just work because the
DELETE
query is never being executed. In our test cases, the extra tags do not change.. D'oh!In general, those should stay unchanged most of the time. So in combination with retrying on deadlocks, this might be good enough.
Out of curiosity, I changed the code to DELETE
each object extra tag directly addressed by both its primary key fields. Well, this also results in a dead lock, as EXPLAIN
already tells us.
MariaDB [notifications]> explain select * from object_extra_tag where object_id = 0xFFFEE3DA6EE137D96E7FDD6757EE7518E7B1A6687884F7A993397DD74DC4D683 and tag = 'servicegroup/webserver';
+------+-------------+------------------+-------+---------------+---------+---------+-------------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------------+-------+---------------+---------+---------+-------------+------+-------+
| 1 | SIMPLE | object_extra_tag | const | PRIMARY | PRIMARY | 1054 | const,const | 1 | |
+------+-------------+------------------+-------+---------------+---------+---------+-------------+------+-------+
1 row in set (0.001 sec)
MariaDB [notifications]> explain delete from object_extra_tag where object_id = 0xFFFEE3DA6EE137D96E7FDD6757EE7518E7B1A6687884F7A993397DD74DC4D683 and tag = 'servicegroup/webserver';
+------+-------------+------------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | object_extra_tag | range | PRIMARY | PRIMARY | 1054 | NULL | 1 | Using where |
+------+-------------+------------------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.001 sec)
While starting to question what "database engine" even means[^1], I would second the idea to retry deadlocks.
[^1]: Maybe just three CSV files in a trench coat.
I don't know exactly how that's even possible but here you're.
2024-07-17T16:02:11.160+0200 INFO incident Contact "Icinga Admin" role changed from subscriber to manager {"object": "docker-master!swap", "incident": "#5"}
2024-07-17T16:02:11.185+0200 ERROR incident Can't insert incident event to the database {"object": "docker-master!swap", "incident": "#5", "error": "Error 1062 (23000): Duplicate entry '5-348' for key 'PRIMARY'"}
2024-07-17T16:02:11.185+0200 ERROR icinga2 Cannot process event {"source-id": 1, "event": "[time=2024-07-17 16:02:11.150006 +0200 CEST m=+0.912499501 type=\"acknowledgement-set\" severity=]", "error": "can't insert incident event to the database"}
---
2024-07-17T16:08:16.259+0200 ERROR incident Can't insert incident event to the database {"object": "docker-master!swap", "incident": "#5", "error": "Error 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (\"notifications\".\"incident_event\", CONSTRAINT \"fk_incident_event_event\" FOREIGN KEY (\"event_id\") REFERENCES \"event\" (\"id\"))"}
2024-07-17T16:08:16.260+0200 ERROR icinga2 Cannot process event {"source-id": 1, "event": "[time=2024-07-17 16:08:16.231294 +0200 CEST m=+0.969004418 type=\"acknowledgement-set\" severity=]", "error": "can't insert incident event to the database"}
I don't know exactly how that's even possible but here you're.
2024-07-17T16:02:11.160+0200 INFO incident Contact "Icinga Admin" role changed from subscriber to manager {"object": "docker-master!swap", "incident": "#5"} 2024-07-17T16:02:11.185+0200 ERROR incident Can't insert incident event to the database {"object": "docker-master!swap", "incident": "#5", "error": "Error 1062 (23000): Duplicate entry '5-348' for key 'PRIMARY'"} 2024-07-17T16:02:11.185+0200 ERROR icinga2 Cannot process event {"source-id": 1, "event": "[time=2024-07-17 16:02:11.150006 +0200 CEST m=+0.912499501 type=\"acknowledgement-set\" severity=]", "error": "can't insert incident event to the database"} --- 2024-07-17T16:08:16.259+0200 ERROR incident Can't insert incident event to the database {"object": "docker-master!swap", "incident": "#5", "error": "Error 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (\"notifications\".\"incident_event\", CONSTRAINT \"fk_incident_event_event\" FOREIGN KEY (\"event_id\") REFERENCES \"event\" (\"id\"))"} 2024-07-17T16:08:16.260+0200 ERROR icinga2 Cannot process event {"source-id": 1, "event": "[time=2024-07-17 16:08:16.231294 +0200 CEST m=+0.969004418 type=\"acknowledgement-set\" severity=]", "error": "can't insert incident event to the database"}
Fixed in https://github.com/Icinga/icinga-notifications/pull/243 🙈
failed to insert object extra tags: Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
Regarding this error, @oxzi pointed the following out to me:
It should currently only be possible to get this error within the test cases. The code issuing the queries running into a conflict are actually done while holding a mutex:
Thus, an Icinga Notifications daemon won't deadlock with itself here. However, when running the tests, go test
can run the tests for different packages in parallel within different processes. Note that t.Parallel()
actually only affects the parallelism between tests within one package.
So I took the liberty to prepare a commit (8a824f1f1795c87d6feaee98e53eab8ad8150a72) that would be ready to use here that disables this behavior. And indeed, with it, the tests work reliably: https://github.com/Icinga/icinga-notifications/actions/runs/9977349345
Only running a single test case against a given database sounds like a good idea in general, so I consider setting -p 1
there a good idea regardless of this PR.
(There's still the issue that we probably want to be able to perform these database queries in parallel at some point, but that's a story for another day and PR.)
@Al2Klimov Feel free to adopt this commit in this PR, that will save you a rebase (otherwise, I'd create it as an independent PR).
Apart from https://github.com/Icinga/icinga-notifications/pull/203#discussion_r1677993630 looks fine for me, I don't know exactly why you replaced text
with mediumtext
throughout the schema, even though it's not necessary in most cases.
One last thought regarding https://github.com/Icinga/icinga-notifications/pull/203#issuecomment-2232917159:
I just had the idea to explicitly use index hints and gave it a last try. First, I have added another INDEX
being identical to the primary key.
CREATE INDEX idx_object_extra_tag_object_id_tag ON object_extra_tag(object_id, tag);
Then, I crafted a DELETE
statement using this index, which requires the quirky multi-table DELETE
syntax:
MariaDB [notifications]> explain delete object_extra_tag.* from object_extra_tag use index(idx_object_extra_tag_object_id_tag) where object_id = 0x00D1C60C34E2B0ADF413EAAFCF31D29CE23A632B57FBC3E7F30312A3F8343571\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: object_extra_tag
type: ref
possible_keys: idx_object_extra_tag_object_id_tag
key: idx_object_extra_tag_object_id_tag
key_len: 32
ref: const
rows: 1
Extra: Using where
1 row in set (0.000 sec)
As one can see, this query would now be of the ref
type. Not const
, but something[^0]. Thus, I have updated Icinga Notifcations, only to run in the same deadlock again. After manually debugging the query and locks[^1] I found out that even this ref
query creates a next-key lock, potentially hitting the supremum. So far.
```diff diff --git a/internal/object/object.go b/internal/object/object.go index 9e54253..be0591e 100644 --- a/internal/object/object.go +++ b/internal/object/object.go @@ -114,8 +114,22 @@ func FromEvent(ctx context.Context, db *database.DB, ev *event.Event) (*Object, return nil, fmt.Errorf("failed to upsert object id tags: %w", err) } + if tx.DriverName() == database.MySQL { + // Without referring an index, the query will choose the suboptimal "range" join type query strategy. Doing so + // will result in a next-key lock, resulting in a potential deadlock. + // https://github.com/Icinga/icinga-notifications/pull/203#issuecomment-2232740619 + // + // Note: MySQL/MariaDB only supports indexes for DELETEs when using multi-table statements. + // https://dev.mysql.com/doc/refman/8.0/en/index-hints.html + stmt = `DELETE object_extra_tag.* FROM "object_extra_tag" USE INDEX (idx_object_extra_tag_object_id_tag)` + } else { + // PostgreSQL just does the right thing by default :) + stmt = `DELETE FROM "object_extra_tag"` + } + stmt += ` WHERE "object_id" = :object_id` + extraTag := &ExtraTagRow{ObjectId: newObject.ID} - _, err = tx.NamedExecContext(ctx, `DELETE FROM "object_extra_tag" WHERE "object_id" = :object_id`, extraTag) + _, err = tx.NamedExecContext(ctx, stmt, extraTag) if err != nil { return nil, fmt.Errorf("failed to delete object extra tags: %w", err) } diff --git a/schema/mysql/schema.sql b/schema/mysql/schema.sql index 37040cf..8810f14 100644 --- a/schema/mysql/schema.sql +++ b/schema/mysql/schema.sql @@ -269,6 +269,10 @@ CREATE TABLE object_extra_tag ( CONSTRAINT fk_object_extra_tag_object FOREIGN KEY (object_id) REFERENCES object(id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin; +-- This index is identical to the primary key, but is allowed to be used as an index hint in both MySQL and MariaDB. +-- Using it for DELETE statements optimizes the query's "join type" from "range" to "ref". +CREATE INDEX idx_object_extra_tag_object_id_tag ON object_extra_tag(object_id, tag); + CREATE TABLE event ( id bigint NOT NULL AUTO_INCREMENT, time bigint NOT NULL, ```
[^0]: Creating an index just for object_id
without key
does not change a thing.
[^1]: Enabling the InnoDB Monitor and Lock Monitor does help and shows more output in SHOW ENGINE INNODB STATUS
, https://mariadb.com/kb/en/innodb-monitors/.
First, I have added another
INDEX
being identical to the primary key.
I'm not entirely sure what the purpose of that second identical index would have been here. Should it have some or did you just miss the following sentence in the documentation? :sweat_smile:
To refer to a primary key, use the name
PRIMARY
.
First, I have added another
INDEX
being identical to the primary key.I'm not entirely sure what the purpose of that second identical index would have been here. Should it have some or did you just miss the following sentence in the documentation? 😅
To refer to a primary key, use the name
PRIMARY
.
Honestly, I was not quite sure if this would work in MariaDB, thus I wanted to be certain. For example, MariaDB's "documentation" does not mention USE KEY
or keys at all, https://mariadb.com/kb/en/use-index/, https://mariadb.com/kb/en/index-hints-how-to-force-query-plans/.
I just tested it, and both USE KEY (PRIMARY)
and USE INDEX (PRIMARY)
do work on MariaDB. However, it "works" as good as the duplicate index does, not very good.
fixes #154
Required
Preferred
Optional