Open oxzi opened 3 days ago
Oh wait, the prior errors messaged as database: Can't execute query. Retrying
are also partially PRIMARY KEY conflicts. For example:
"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"service_state\\" (\\"soft_state\\", \\"last_update\\", \\"id\\", \\"last_comment_id\\", \\"check_commandline\\", \\"last_state_change\\", \\"severity\\", \\"check_attempt\\", \\"is_handled\\", \\"is_problem\\", \\"next_update\\", \\"execution_time\\", \\"is_acknowledged\\", \\"hard_state\\", \\"output\\", \\"state_type\\", \\"host_id\\", \\"scheduling_source\\", \\"latency\\", \\"previous_hard_state\\", \\"check_timeout\\", \\"acknowledgement_comment_id\\", \\"next_check\\", \\"is_flapping\\", \\"environment_id\\", \\"in_downtime\\", \\"is_reachable\\", \\"long_output\\", \\"properties_checksum\\", \\"performance_data\\", \\"normalized_performance_data\\", \\"previous_soft_state\\", \\"service_id\\", \\"check_source\\") VALUES (:soft_state, :last_update, :id, :last_comment_id, :check_commandline, :last_state_change, :severity, :check_attempt, :is_handled, :is_problem, :next_update, :execution_time, :is_acknowledged, :hard_state, :output, :state_type, :host_id, :scheduling_source, :latency, :previous_hard_state, :check_timeout, :acknowledgement_comment_id, :next_check, :is_flapping, :environment_id, :in_downtime, :is_reachable, :long_output, :properties_checksum, :performance_data, :normalized_performance_data, :previous_soft_state, :service_id, :check_source)\": Error 1062 (23000): Duplicate entry '\x8Er\xF3\xD4\xAD \xCB\xC1\x94\xDAr9\x0C\xE6\xEA:\xEA\xACw2' for key 'service_state.PRIMARY'"
"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"downtime\\" (\\"host_id\\", \\"scheduled_end_time\\", \\"start_time\\", \\"end_time\\", \\"scheduled_start_time\\", \\"zone_id\\", \\"parent_id\\", \\"author\\", \\"comment\\", \\"scheduled_duration\\", \\"object_type\\", \\"service_id\\", \\"flexible_duration\\", \\"duration\\", \\"name\\", \\"id\\", \\"entry_time\\", \\"is_in_effect\\", \\"scheduled_by\\", \\"properties_checksum\\", \\"triggered_by_id\\", \\"is_flexible\\", \\"environment_id\\", \\"name_checksum\\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)\": Error 1062 (23000): Duplicate entry '\xEAq'&\xA2v\xD4!\xD5\xCE=\xC5\xB7z\x0F0\xD7\x14\x8D\xD2' for key 'downtime.PRIMARY'"
"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"downtime\\" (\\"host_id\\", \\"scheduled_end_time\\", \\"start_time\\", \\"end_time\\", \\"scheduled_start_time\\", \\"zone_id\\", \\"parent_id\\", \\"author\\", \\"comment\\", \\"scheduled_duration\\", \\"object_type\\", \\"service_id\\", \\"flexible_duration\\", \\"duration\\", \\"name\\", \\"id\\", \\"entry_time\\", \\"is_in_effect\\", \\"scheduled_by\\", \\"properties_checksum\\", \\"triggered_by_id\\", \\"is_flexible\\", \\"environment_id\\", \\"name_checksum\\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)\": Error 1062 (23000): Duplicate entry '\xF6\xF6U'\xEC\x7FW\xCC\xF4\x0F5\x9BgGJ\xCB[oIz' for key 'downtime.PRIMARY'"
"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"host_customvar\\" (\\"host_id\\", \\"customvar_id\\", \\"environment_id\\", \\"id\\") VALUES (:host_id, :customvar_id, :environment_id, :id)\": Error 1062 (23000): Duplicate entry '\xF0\xBAyn\x8F\x9F\xE52:\xD4\xBE-\x86\xD8O\xE2\x03\x05\xD1\xC0' for key 'host_customvar.PRIMARY'"
"ICINGADB_ERROR": "can't perform \"INSERT INTO \\"downtime\\" (\\"host_id\\", \\"scheduled_end_time\\", \\"start_time\\", \\"end_time\\", \\"scheduled_start_time\\", \\"zone_id\\", \\"parent_id\\", \\"author\\", \\"comment\\", \\"scheduled_duration\\", \\"object_type\\", \\"service_id\\", \\"flexible_duration\\", \\"duration\\", \\"name\\", \\"id\\", \\"entry_time\\", \\"is_in_effect\\", \\"scheduled_by\\", \\"properties_checksum\\", \\"triggered_by_id\\", \\"is_flexible\\", \\"environment_id\\", \\"name_checksum\\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)\": Error 1062 (23000): Duplicate entry '\xF1\x9B\x15zc'C\xEA\xDEF\xCBA\xAB\x82y!(\xACI\xE3' for key 'downtime.PRIMARY'"
As the other Icinga DB instance is HA passive - as the logs shows - it should not be responsible for primary key constraint violations in those tables. Can this be due to a prior, aborted sync or is something racing itself? The latter would be supported by the multiple "Deadlock found when trying to get lock; try restarting transaction" errors.
By the way, all 35 occurrences of "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction" happened for INSERTS into service_state
.
After stopping both Icinga DB nodes, truncating the service_state
table and restarting the nodes, the active node complained multiple times about an "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction". However, these errors were recoverable and Icinga DB survived.
I just went over the whole synchronization code and am now more confused than before: how could a the primary key constraint be violated from one single node? The delta is calculated based on what is currently within the SQL database and within the Redis. If something is part of both, it shouldn't be in delta.Create
.
The logs from the crashed node posted above contains:
config-sync: Inserting 245858 items of type service state config-sync: Updating 39 items of type service state
At the moment (which may not be comparable), the system has the following state:
127.0.0.1:6380> HLEN icinga:service:state
(integer) 307538
mysql> select count(*) from service_state\G
*************************** 1. row ***************************
count(*): 307538
1 row in set (0.03 sec)
Under the assumption that this number hasn't changed, there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB. Unless I have missed something, there is no code preventing this case. And, to bring evidence, there were multiple occurrences of FLUSHALL
in the ~/.rediscli_history
file.
there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB
Oh wait, that would be the wrong way round. If something is missing in Redis, but available in the SQL database, it would be set as delta.Delete
and not .Create
.
I tried reproducing the behavior with some FLUSHALL
s, but this did not worked out. However, I was able to crash Icinga DB in another flavor when flushing during the sync, but this is fine, I guess.
redis: HMGET icinga:downtime: field "4bd67fe5e10611a5c312ba1b05d8cfe6d7317662" missing
redis: HMGET icinga:downtime: field "5aacf4988f2686df77cc5c5660855d2deafd1185" missing
redis: HMGET icinga:downtime: field "6e97e19d87853b00477c575a2ed25c0c6204ac2e" missing
redis: HMGET icinga:downtime: field "8d77e6661693adba73c4802c98a0bff07ad07126" missing
redis: HMGET icinga:downtime: field "d3537844bfa9e7b4027c106d5ade0eb9c02720d8" missing
redis: HMGET icinga:downtime: field "d7ae5700b04633be4ccd6d52daa397fbdc1d1c14" missing
redis: HMGET icinga:downtime: field "0ce73ba7968ec4350c7a7d0ca4b00bc811336356" missing
redis: HMGET icinga:downtime: field "2c0bbf8054855095870030ce9a92bfaefea7a1fc" missing
redis: HMGET icinga:downtime: field "3d028f5662ddf7c3a9152c6bef4461a57eb993cb" missing
redis: HMGET icinga:downtime: field "4924e173f3ed5ea0001ef275818e85c3e827cccd" missing
redis: HMGET icinga:downtime: field "6444a66dbc37c526d03c2bc2e921cf6bf0ca6c92" missing
redis: HMGET icinga:downtime: field "bc4652319400f81d232884583cd6f96173146d14" missing
redis: HMGET icinga:downtime: field "c658c0430fa9e59a8e1b7b452c2782c299605e2b" missing
redis: Finished fetching from icinga:downtime with 0 items in 367.842935ms
database: Finished executing "SELECT \"properties_checksum\", \"id\" FROM \"host_state\" WHERE \"environment_id\" = :environment_id" with 15944>
database: Finished executing "UPDATE \"host_state\" SET \"is_overdue\" = :is_overdue, \"id\" = :id WHERE id = :id" with 6207 rows in 5.41943569>
database: Finished executing "SELECT \"id\" FROM \"host_customvar\" WHERE \"environment_id\" = :environment_id" with 125000 rows in 3.172386254s
redis: Finished fetching from icinga:host:customvar with 249905 items in 3.172477344s
config-sync: Aborted config sync after 8.016455329s
config-sync: Aborted initial state sync after 8.016473497s
can't copy from closed channel
github.com/icinga/icinga-go-library/com.CopyFirst[...]
/Users/yhabteab/Workspace/go/icinga-go-library/com/com.go:64
github.com/icinga/icinga-go-library/database.(*DB).UpsertStreamed
/Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:708
github.com/icinga/icingadb/pkg/icingadb.Sync.ApplyDelta.func2
/Users/yhabteab/Workspace/go/icingadb/pkg/icingadb/sync.go:158
golang.org/x/sync/errgroup.(*Group).Go.func1
/Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78
runtime.goexit
/opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700
can't copy first entity
github.com/icinga/icinga-go-library/database.(*DB).UpsertStreamed
/Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:710
github.com/icinga/icingadb/pkg/icingadb.Sync.ApplyDelta.func2
/Users/yhabteab/Workspace/go/icingadb/pkg/icingadb/sync.go:158
golang.org/x/sync/errgroup.(*Group).Go.func1
/Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78
runtime.goexit
Another observation was during experiments with the other node being stopped, no SQL deadlock errors occurred. When I wanted to verify this, I re-ran the same scenario with both nodes being active, and also no deadlocks occurred..
Describe the bug
In a bigger Icinga setup, as built by @yhabteab last week, Icinga DB may abort during the initial state and config sync due to an exceeded lock wait timeout. The setup consists of two Icinga DB instances running in HA mode.
The used Icinga DB version was, afaik, the latest Icinga DB release v1.2.0 with #800 cherry-picked on top. Please note that this error is not related to HA. Furthermore, no two instances were active at the same time.
Detailed logs for this run, acquired by some version of:
Please note, that the first node crashed, but the other then took over and worked without any issues. This happened three times last Friday, but, of course, today I was not able to reproduce it.
To Reproduce
Expected behavior
Not crash
Your Environment
Include as many relevant details about the environment you experienced the problem in
Additional context
N/A