"Aborted initial state sync" / "Aborted config sync" because of "Lock wait timeout exceeded"

oxzi commented 3 days ago

Describe the bug

In a bigger Icinga setup, as built by @yhabteab last week, Icinga DB may abort during the initial state and config sync due to an exceeded lock wait timeout. The setup consists of two Icinga DB instances running in HA mode.

config-sync: Aborted initial state sync after 5m37.981933179s
config-sync: Aborted config sync after 5m37.982500933s
Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
can't perform "INSERT INTO \"service_state\" (\"soft_state\", \"last_update\", \"id\", \"last_comment_id\", \"check_commandline>
github.com/icinga/icinga-go-library/database.CantPerformQuery
        /Users/yhabteab/Workspace/go/icinga-go-library/database/utils.go:15
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:505
github.com/icinga/icinga-go-library/retry.WithBackoff
        /Users/yhabteab/Workspace/go/icinga-go-library/retry/retry.go:64
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:500
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700
retry deadline exceeded
github.com/icinga/icinga-go-library/retry.WithBackoff
        /Users/yhabteab/Workspace/go/icinga-go-library/retry/retry.go:99
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:500
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700

The used Icinga DB version was, afaik, the latest Icinga DB release v1.2.0 with #800 cherry-picked on top. Please note that this error is not related to HA. Furthermore, no two instances were active at the same time.

Detailed logs for this run, acquired by some version of:

$ journalctl -u icingadb -o json | | jq '._SOURCE_REALTIME_TIMESTAMP |= (tonumber | . / 1000000 | todate) | {"PRIORITY", "_SOURCE_REALTIME_TIMESTAMP", "MESSAGE", "ICINGADB_REASON", "ICINGADB_ERROR", "ICINGADB_RECOVERED_ERROR", "ICINGADB_ATTEMPTS", "ICINGADB_AFTER"} | del(..|nulls)'

Please note, that the first node crashed, but the other then took over and worked without any issues. This happened three times last Friday, but, of course, today I was not able to reproduce it.

To Reproduce

Create a huge Icinga environment, as documented (interally) from @yhabteab.
???
Profit!

Expected behavior

Not crash

Your Environment

Include as many relevant details about the environment you experienced the problem in

Icinga DB version: v1.2.0 w/ #800
Icinga 2 version: v2.14.2-1
Operating System and version: AlmaLinux 9.4 (crashing node) and Debian GNU/Linux 12 (other node)

Additional context

N/A

oxzi commented 3 days ago

Oh wait, the prior errors messaged as database: Can't execute query. Retrying are also partially PRIMARY KEY conflicts. For example:

"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"service_state\\" (\\"soft_state\\", \\"last_update\\", \\"id\\", \\"last_comment_id\\", \\"check_commandline\\", \\"last_state_change\\", \\"severity\\", \\"check_attempt\\", \\"is_handled\\", \\"is_problem\\", \\"next_update\\", \\"execution_time\\", \\"is_acknowledged\\", \\"hard_state\\", \\"output\\", \\"state_type\\", \\"host_id\\", \\"scheduling_source\\", \\"latency\\", \\"previous_hard_state\\", \\"check_timeout\\", \\"acknowledgement_comment_id\\", \\"next_check\\", \\"is_flapping\\", \\"environment_id\\", \\"in_downtime\\", \\"is_reachable\\", \\"long_output\\", \\"properties_checksum\\", \\"performance_data\\", \\"normalized_performance_data\\", \\"previous_soft_state\\", \\"service_id\\", \\"check_source\\") VALUES (:soft_state, :last_update, :id, :last_comment_id, :check_commandline, :last_state_change, :severity, :check_attempt, :is_handled, :is_problem, :next_update, :execution_time, :is_acknowledged, :hard_state, :output, :state_type, :host_id, :scheduling_source, :latency, :previous_hard_state, :check_timeout, :acknowledgement_comment_id, :next_check, :is_flapping, :environment_id, :in_downtime, :is_reachable, :long_output, :properties_checksum, :performance_data, :normalized_performance_data, :previous_soft_state, :service_id, :check_source)\": Error 1062 (23000): Duplicate entry '\x8Er\xF3\xD4\xAD \xCB\xC1\x94\xDAr9\x0C\xE6\xEA:\xEA\xACw2' for key 'service_state.PRIMARY'"

"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"downtime\\" (\\"host_id\\", \\"scheduled_end_time\\", \\"start_time\\", \\"end_time\\", \\"scheduled_start_time\\", \\"zone_id\\", \\"parent_id\\", \\"author\\", \\"comment\\", \\"scheduled_duration\\", \\"object_type\\", \\"service_id\\", \\"flexible_duration\\", \\"duration\\", \\"name\\", \\"id\\", \\"entry_time\\", \\"is_in_effect\\", \\"scheduled_by\\", \\"properties_checksum\\", \\"triggered_by_id\\", \\"is_flexible\\", \\"environment_id\\", \\"name_checksum\\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)\": Error 1062 (23000): Duplicate entry '\xEAq'&\xA2v\xD4!\xD5\xCE=\xC5\xB7z\x0F0\xD7\x14\x8D\xD2' for key 'downtime.PRIMARY'"

"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"downtime\\" (\\"host_id\\", \\"scheduled_end_time\\", \\"start_time\\", \\"end_time\\", \\"scheduled_start_time\\", \\"zone_id\\", \\"parent_id\\", \\"author\\", \\"comment\\", \\"scheduled_duration\\", \\"object_type\\", \\"service_id\\", \\"flexible_duration\\", \\"duration\\", \\"name\\", \\"id\\", \\"entry_time\\", \\"is_in_effect\\", \\"scheduled_by\\", \\"properties_checksum\\", \\"triggered_by_id\\", \\"is_flexible\\", \\"environment_id\\", \\"name_checksum\\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)\": Error 1062 (23000): Duplicate entry '\xF6\xF6U'\xEC\x7FW\xCC\xF4\x0F5\x9BgGJ\xCB[oIz' for key 'downtime.PRIMARY'"

"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"host_customvar\\" (\\"host_id\\", \\"customvar_id\\", \\"environment_id\\", \\"id\\") VALUES (:host_id, :customvar_id, :environment_id, :id)\": Error 1062 (23000): Duplicate entry '\xF0\xBAyn\x8F\x9F\xE52:\xD4\xBE-\x86\xD8O\xE2\x03\x05\xD1\xC0' for key 'host_customvar.PRIMARY'"

"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"downtime\\" (\\"host_id\\", \\"scheduled_end_time\\", \\"start_time\\", \\"end_time\\", \\"scheduled_start_time\\", \\"zone_id\\", \\"parent_id\\", \\"author\\", \\"comment\\", \\"scheduled_duration\\", \\"object_type\\", \\"service_id\\", \\"flexible_duration\\", \\"duration\\", \\"name\\", \\"id\\", \\"entry_time\\", \\"is_in_effect\\", \\"scheduled_by\\", \\"properties_checksum\\", \\"triggered_by_id\\", \\"is_flexible\\", \\"environment_id\\", \\"name_checksum\\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)\": Error 1062 (23000): Duplicate entry '\xF1\x9B\x15zc'C\xEA\xDEF\xCBA\xAB\x82y!(\xACI\xE3' for key 'downtime.PRIMARY'"

As the other Icinga DB instance is HA passive - as the logs shows - it should not be responsible for primary key constraint violations in those tables. Can this be due to a prior, aborted sync or is something racing itself? The latter would be supported by the multiple "Deadlock found when trying to get lock; try restarting transaction" errors.

By the way, all 35 occurrences of "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction" happened for INSERTS into service_state.

oxzi commented 2 days ago

After stopping both Icinga DB nodes, truncating the service_state table and restarting the nodes, the active node complained multiple times about an "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction". However, these errors were recoverable and Icinga DB survived.

I just went over the whole synchronization code and am now more confused than before: how could a the primary key constraint be violated from one single node? The delta is calculated based on what is currently within the SQL database and within the Redis. If something is part of both, it shouldn't be in delta.Create.

The logs from the crashed node posted above contains:

config-sync: Inserting 245858 items of type service state config-sync: Updating 39 items of type service state

At the moment (which may not be comparable), the system has the following state:

127.0.0.1:6380> HLEN icinga:service:state
(integer) 307538

mysql> select count(*) from service_state\G
*************************** 1. row ***************************
count(*): 307538
1 row in set (0.03 sec)

Under the assumption that this number hasn't changed, there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB. Unless I have missed something, there is no code preventing this case. And, to bring evidence, there were multiple occurrences of FLUSHALL in the ~/.rediscli_history file.

oxzi commented 2 days ago

there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB

Oh wait, that would be the wrong way round. If something is missing in Redis, but available in the SQL database, it would be set as delta.Delete and not .Create.

I tried reproducing the behavior with some FLUSHALLs, but this did not worked out. However, I was able to crash Icinga DB in another flavor when flushing during the sync, but this is fine, I guess.

redis: HMGET icinga:downtime: field "4bd67fe5e10611a5c312ba1b05d8cfe6d7317662" missing
redis: HMGET icinga:downtime: field "5aacf4988f2686df77cc5c5660855d2deafd1185" missing
redis: HMGET icinga:downtime: field "6e97e19d87853b00477c575a2ed25c0c6204ac2e" missing
redis: HMGET icinga:downtime: field "8d77e6661693adba73c4802c98a0bff07ad07126" missing
redis: HMGET icinga:downtime: field "d3537844bfa9e7b4027c106d5ade0eb9c02720d8" missing
redis: HMGET icinga:downtime: field "d7ae5700b04633be4ccd6d52daa397fbdc1d1c14" missing
redis: HMGET icinga:downtime: field "0ce73ba7968ec4350c7a7d0ca4b00bc811336356" missing
redis: HMGET icinga:downtime: field "2c0bbf8054855095870030ce9a92bfaefea7a1fc" missing
redis: HMGET icinga:downtime: field "3d028f5662ddf7c3a9152c6bef4461a57eb993cb" missing
redis: HMGET icinga:downtime: field "4924e173f3ed5ea0001ef275818e85c3e827cccd" missing
redis: HMGET icinga:downtime: field "6444a66dbc37c526d03c2bc2e921cf6bf0ca6c92" missing
redis: HMGET icinga:downtime: field "bc4652319400f81d232884583cd6f96173146d14" missing
redis: HMGET icinga:downtime: field "c658c0430fa9e59a8e1b7b452c2782c299605e2b" missing
redis: Finished fetching from icinga:downtime with 0 items in 367.842935ms
database: Finished executing "SELECT \"properties_checksum\", \"id\" FROM \"host_state\" WHERE \"environment_id\" = :environment_id" with 15944>
database: Finished executing "UPDATE \"host_state\" SET \"is_overdue\" = :is_overdue, \"id\" = :id WHERE id = :id" with 6207 rows in 5.41943569>
database: Finished executing "SELECT \"id\" FROM \"host_customvar\" WHERE \"environment_id\" = :environment_id" with 125000 rows in 3.172386254s
redis: Finished fetching from icinga:host:customvar with 249905 items in 3.172477344s
config-sync: Aborted config sync after 8.016455329s
config-sync: Aborted initial state sync after 8.016473497s
can't copy from closed channel
github.com/icinga/icinga-go-library/com.CopyFirst[...]
        /Users/yhabteab/Workspace/go/icinga-go-library/com/com.go:64
github.com/icinga/icinga-go-library/database.(*DB).UpsertStreamed
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:708
github.com/icinga/icingadb/pkg/icingadb.Sync.ApplyDelta.func2
        /Users/yhabteab/Workspace/go/icingadb/pkg/icingadb/sync.go:158
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700
can't copy first entity
github.com/icinga/icinga-go-library/database.(*DB).UpsertStreamed
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:710
github.com/icinga/icingadb/pkg/icingadb.Sync.ApplyDelta.func2
        /Users/yhabteab/Workspace/go/icingadb/pkg/icingadb/sync.go:158
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78
runtime.goexit

Another observation was during experiments with the other node being stopped, no SQL deadlock errors occurred. When I wanted to verify this, I re-ran the same scenario with both nodes being active, and also no deadlocks occurred..

Icinga / icingadb