Icinga / icingadb

Icinga configuration and state database supporting multiple environments
https://icinga.com
GNU General Public License v2.0
59 stars 21 forks source link

Crash due to MySQL deadlock in retention query #485

Closed julianbrost closed 2 years ago

julianbrost commented 2 years ago

While testing #484 (I was using 2532d06e0ae92547f1019a49e3fc47558a9f53c3 to be precise), I was randomly greeted by the following crash. As that PR does nothing with retention at all, I presume the same may happen on the current master as well.

2022-05-16T16:25:52.784+0200    FATAL   icingadb    Error 1213: Deadlock found when trying to get lock; try restarting transaction
can't perform "DELETE FROM sla_history_state WHERE environment_id = :environment_id AND event_time < :time\nORDER BY event_time LIMIT 5000"
github.com/icinga/icingadb/internal.CantPerformQuery
    /home/jbrost/dev/icingadb/internal/internal.go:30
github.com/icinga/icingadb/pkg/icingadb.(*DB).CleanupOlderThan
    /home/jbrost/dev/icingadb/pkg/icingadb/cleanup.go:51
github.com/icinga/icingadb/pkg/icingadb/history.(*Retention).Start.func1
    /home/jbrost/dev/icingadb/pkg/icingadb/history/retention.go:188
github.com/icinga/icingadb/pkg/periodic.Start.func1
    /home/jbrost/dev/icingadb/pkg/periodic/periodic.go:78
runtime.goexit
    /usr/lib/go/src/runtime/asm_amd64.s:1571
exit status 1

Not sure why that happened or how to reproduce it, but that shouldn't bring down the process for sure.

julianbrost commented 2 years ago

I'm able to reproduce this somewhat reliably with the following steps:

  1. Stop Icinga DB
  2. Generate some history events (I waited for XLEN icinga:history:stream:state to be about 1000, should contain enough hard state changes so that it writes to the SLA table)
  3. Start Icinga DB with the following retention config:
    retention:
      sla-days: 400
      interval: 1ns
lippserd commented 2 years ago

@julianbrost Can you please run SHOW ENGINE INNODB STATUS to get more information about the deadlock?

julianbrost commented 2 years ago
MariaDB [icingadb]> SHOW ENGINE INNODB STATUS \G
*************************** 1. row ***************************
  Type: InnoDB
  Name: 
Status: 
=====================================
2022-06-13 12:12:26 0x7fb92c4e5700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 6 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 0 srv_active, 0 srv_shutdown, 1754 srv_idle
srv_master_thread log flush and writes: 1754
----------
SEMAPHORES
----------
------------------------
LATEST DETECTED DEADLOCK
------------------------
2022-06-13 12:12:25 0x7fb918152700
*** (1) TRANSACTION:
TRANSACTION 99495, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 41
MariaDB thread id 47, OS thread handle 140432949716736, query id 297591 172.24.0.1 icingadb-test Update
INSERT INTO "sla_history_state" ("environment_id", "id", "event_time", "previous_hard_state", "endpoint_id", "object_type", "host_id", "service_id", "hard_state") VALUES (?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,
*** WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 63 page no 5 n bits 128 index PRIMARY of table "icingadb"."sla_history_state" trx id 99495 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
 0: len 20; hex 008806e58c5b9e40660e0db068ac610055207d58; asc      [ @f   h a U }X;;
 1: len 6; hex 000000000000; asc       ;;
 2: len 7; hex 80000000000000; asc        ;;
 3: len 20; hex 6cc4013ece4ffdbb275a4a763460507d17203146; asc l  > O  'ZJv4`P}  1F;;
 4: len 20; hex 2bea57085609682477bde443bff3e4081558cd15; asc + W V h$w  C     X  ;;
 5: len 1; hex 02; asc  ;;
 6: len 20; hex 1b74f41cfbd98af27a2c59e544c54b30814837aa; asc  t      z,Y D K0 H7 ;;
 7: len 20; hex fce7d8b6f0543ae57219c8df5dfbf18d481a49ec; asc      T: r   ]   H I ;;
 8: len 8; hex 000001815ce0cf20; asc     \   ;;
 9: len 1; hex 00; asc  ;;
 10: len 1; hex 02; asc  ;;

*** CONFLICTING WITH:
RECORD LOCKS space id 63 page no 5 n bits 128 index PRIMARY of table "icingadb"."sla_history_state" trx id 99497 lock_mode X
Record lock, heap no 2 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
 0: len 20; hex 0027dc5609266f29318a613799f518f99cbe13ce; asc  ' V &o)1 a7        ;;
 1: len 6; hex 000000000000; asc       ;;
 2: len 7; hex 80000000000000; asc        ;;
 3: len 20; hex 6cc4013ece4ffdbb275a4a763460507d17203146; asc l  > O  'ZJv4`P}  1F;;
 4: len 20; hex 2bea57085609682477bde443bff3e4081558cd15; asc + W V h$w  C     X  ;;
 5: len 1; hex 02; asc  ;;
 6: len 20; hex 1b74f41cfbd98af27a2c59e544c54b30814837aa; asc  t      z,Y D K0 H7 ;;
 7: len 20; hex 9fcfc38de65faf92c2db15f571ff58a747f86d0f; asc      _      q X G m ;;
 8: len 8; hex 000001815ce0b917; asc     \   ;;
 9: len 1; hex 00; asc  ;;
 10: len 1; hex 02; asc  ;;

Record lock, heap no 3 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
 0: len 20; hex 008806e58c5b9e40660e0db068ac610055207d58; asc      [ @f   h a U }X;;
 1: len 6; hex 000000000000; asc       ;;
 2: len 7; hex 80000000000000; asc        ;;
 3: len 20; hex 6cc4013ece4ffdbb275a4a763460507d17203146; asc l  > O  'ZJv4`P}  1F;;
 4: len 20; hex 2bea57085609682477bde443bff3e4081558cd15; asc + W V h$w  C     X  ;;
 5: len 1; hex 02; asc  ;;
 6: len 20; hex 1b74f41cfbd98af27a2c59e544c54b30814837aa; asc  t      z,Y D K0 H7 ;;
 7: len 20; hex fce7d8b6f0543ae57219c8df5dfbf18d481a49ec; asc      T: r   ]   H I ;;
 8: len 8; hex 000001815ce0cf20; asc     \   ;;
 9: len 1; hex 00; asc  ;;
 10: len 1; hex 02; asc  ;;

Record lock, heap no 4 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
 0: len 20; hex 01854f209355f7974779acc437c3f2d8879258d7; asc   O  U  Gy  7     X ;;
 1: len 6; hex 000000000000; asc       ;;
 2: len 7; hex 80000000000000; asc        ;;
 3: len 20; hex 6cc4013ece4ffdbb275a4a763460507d17203146; asc l  > O  'ZJv4`P}  1F;;
 4: len 20; hex 2bea57085609682477bde443bff3e4081558cd15; asc + W V h$w  C     X  ;;
 5: len 1; hex 02; asc  ;;
 6: len 20; hex 1b74f41cfbd98af27a2c59e544c54b30814837aa; asc  t      z,Y D K0 H7 ;;
 7: len 20; hex 12ec27868959d05a000f5942bbcc5a66f49f7f6f; asc   '  Y Z  YB  Zf   o;;
 8: len 8; hex 000001815ce0cab5; asc     \   ;;
 9: len 1; hex 00; asc  ;;
 10: len 1; hex 02; asc  ;;

*** (2) TRANSACTION:
TRANSACTION 99497, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1128, 4 row lock(s)
MariaDB thread id 45, OS thread handle 140433288701696, query id 297631 172.24.0.1 icingadb-test init for update
DELETE FROM sla_history_state WHERE environment_id = ? AND event_time < ?
ORDER BY event_time LIMIT 5000
*** WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 63 page no 5 n bits 128 index PRIMARY of table "icingadb"."sla_history_state" trx id 99497 lock_mode X waiting
Record lock, heap no 59 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
 0: len 20; hex 04b6a7f3357101a31a7bb8fd366c73a421b012fa; asc     5q   {  6ls !   ;;
 1: len 6; hex 0000000184a7; asc       ;;
 2: len 7; hex 9e000001530523; asc     S #;;
 3: len 20; hex 6cc4013ece4ffdbb275a4a763460507d17203146; asc l  > O  'ZJv4`P}  1F;;
 4: len 20; hex 2bea57085609682477bde443bff3e4081558cd15; asc + W V h$w  C     X  ;;
 5: len 1; hex 02; asc  ;;
 6: len 20; hex 1b74f41cfbd98af27a2c59e544c54b30814837aa; asc  t      z,Y D K0 H7 ;;
 7: len 20; hex 16017521506035ace937bfed360a9186d9490f5c; asc   u!P`5  7  6    I \;;
 8: len 8; hex 000001815ce1da81; asc     \   ;;
 9: len 1; hex 02; asc  ;;
 10: len 1; hex 00; asc  ;;

*** CONFLICTING WITH:
RECORD LOCKS space id 63 page no 5 n bits 128 index PRIMARY of table "icingadb"."sla_history_state" trx id 99495 lock_mode X locks rec but not gap
Record lock, heap no 59 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
 0: len 20; hex 04b6a7f3357101a31a7bb8fd366c73a421b012fa; asc     5q   {  6ls !   ;;
 1: len 6; hex 0000000184a7; asc       ;;
 2: len 7; hex 9e000001530523; asc     S #;;
 3: len 20; hex 6cc4013ece4ffdbb275a4a763460507d17203146; asc l  > O  'ZJv4`P}  1F;;
 4: len 20; hex 2bea57085609682477bde443bff3e4081558cd15; asc + W V h$w  C     X  ;;
 5: len 1; hex 02; asc  ;;
 6: len 20; hex 1b74f41cfbd98af27a2c59e544c54b30814837aa; asc  t      z,Y D K0 H7 ;;
 7: len 20; hex 16017521506035ace937bfed360a9186d9490f5c; asc   u!P`5  7  6    I \;;
 8: len 8; hex 000001815ce1da81; asc     \   ;;
 9: len 1; hex 02; asc  ;;
 10: len 1; hex 00; asc  ;;

*** WE ROLL BACK TRANSACTION (0)
------------
TRANSACTIONS
------------
Trx id counter 99523
Purge done for trx's n:o < 99521 undo n:o < 255 state: running
History list length 32
LIST OF TRANSACTIONS FOR EACH SESSION:
--------
FILE I/O
--------
Pending flushes (fsync) log: 0; buffer pool: 0
172 OS file reads, 291 OS file writes, 1289 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 3.33 writes/s, 45.66 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 18737471
Log flushed up to   18737471
Pages flushed up to 42197
Last checkpoint at  42185
0 pending log flushes, 0 pending chkp writes
293 log i/o's done, 3.33 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 167772160
Dictionary memory allocated 994744
Buffer pool size   8112
Free buffers       6615
Database pages     1497
Old database pages 532
Modified db pages  1357
Percent of dirty pages(LRU & free pages): 16.726
Max dirty pages percent: 90.000
Pending reads 0
Pending writes: LRU 0, flush list 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 158, created 1339, written 0
0.00 reads/s, 74.49 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 1497, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 read views open inside InnoDB
Process ID=0, Main thread ID=0, state: sleeping
Number of rows inserted 26572, updated 526, deleted 1, read 8044688
2468.92 inserts/s, 86.32 updates/s, 0.00 deletes/s, 50662.72 reads/s
Number of system rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

1 row in set (0.000 sec)