icingadb process crashes with Error 1452: Cannot add or update a child row: a foreign key constraint fails

patrickesser commented 1 year ago

The icingadb process crashes at irregular intervals with the following output:

Error 1452: Cannot add or update a child row: a foreign key constraint fails ("icingadb"."history", CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE) can't perform "INSERT INTO \"history\" (\"environment_id\", \"endpoint_id\", \"host_id\", \"service_id\", \"event_time\", \"object_type\", \"event_type\", \"id\", \"state_history_id\") VALUES (:environment_id,:endpoint_id,:host_id,:service_id,:event_time,:object_type,:event_type,:id,:state_history_id) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")" github.com/icinga/icingadb/internal.CantPerformQuery github.com/icinga/icingadb/internal/internal.go:30 github.com/icinga/icingadb/pkg/icingadb.(DB).NamedBulkExec.func1.1.1.1 github.com/icinga/icingadb/pkg/icingadb/db.go:394 github.com/icinga/icingadb/pkg/retry.WithBackoff github.com/icinga/icingadb/pkg/retry/retry.go:45 github.com/icinga/icingadb/pkg/icingadb.(DB).NamedBulkExec.func1.1.1 github.com/icinga/icingadb/pkg/icingadb/db.go:389 golang.org/x/sync/errgroup.(Group).Go.func1 golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57 runtime.goexit runtime/asm_amd64.s:1594 can't retry github.com/icinga/icingadb/pkg/retry.WithBackoff github.com/icinga/icingadb/pkg/retry/retry.go:64 github.com/icinga/icingadb/pkg/icingadb.(DB).NamedBulkExec.func1.1.1 github.com/icinga/icingadb/pkg/icingadb/db.go:389 golang.org/x/sync/errgroup.(*Group).Go.func1 golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57 runtime.goexit runtime/asm_amd64.s:1594

__

I can't find more detailed log or any posibillity to figure out which entry causes the crash.

Your Environment

Icinga DB version: v1.1.0

Build information: Go version: go1.19.3 (linux, amd64) Git commit: a0093d1df6a227bfc923717072d8cb0e3331799f

System information: Platform: Debian GNU/Linux Platform version: 11 (bullseye)

Al2Klimov commented 1 year ago

@patrickesser Are there other kinds of errors/warnings in the log? Which magnitude of irregular intervals are we talking about? Btw., what are your typical state change intervals? And, btw., do the crashes correlate with state changes of particular objects? What's your DB type/version/cluster?

@julianbrost Your family name dominates in IDE annotation of history sync code. Any speculative idea what could happen here? As far as I see we execute bulks outside of transactions. Once committed(!) we inform via onSuccess that the Redis message may pass to the next stage which obviously hits the FK.

julianbrost commented 1 year ago

Any speculative idea what could happen here?

Not really, that code should be written to prevents this, so I'm eager to know what is going wrong here.

Some more questions for @patrickesser:

After this crash happened, did a restart suffice to get things running again? (If there was something wrong with a specific history entry, it should retry that entry and would fail again, so if that didn't happen it's probably our code or the database doing something unexpected here.)
Do you run Icinga DB in a high-availability configuration? (Now I'm really getting into wild speculation: maybe if one connection updates a row, another connection might get a foreign key violation when referencing that specific row at the very time the other update is happening?)

Al2Klimov commented 1 year ago

Btw. for such cases, shall we re-try (at least a limited amount of times) serialization failures?

julianbrost commented 1 year ago

The database doesn't report a serialization failure here. I wouldn't retry here without any idea what's actually going on.

Al2Klimov commented 1 year ago

Oh, sorry. I've mixed up GH issues. I meant FK failures of course.

julianbrost commented 1 year ago

Sounds too much like "no idea what's going on, let's try this and hope for the best". If we knew that was some bug/limitation of the database, that could be a workaround, but not really the current state of information.

NavidSassan commented 1 year ago

we also see the following sql errors on a master-master setup, with a mariadb galera cluster as the database:

Error 1452: Cannot add or update a child row: a foreign key constraint fails ("icingadb"."user_notification_history", CONSTRAINT "fk_user_notification_history_notification_history">
    can't perform "INSERT INTO \"user_notification_history\" (\"id\", \"notification_history_id\", \"user_id\", \"environment_id\") VALUES (:id,:notification_history_id,:user_id,:envir>
    github.com/icinga/icingadb/internal.CantPerformQuery
            github.com/icinga/icingadb/internal/internal.go:30
    github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
            github.com/icinga/icingadb/pkg/icingadb/db.go:394
    github.com/icinga/icingadb/pkg/retry.WithBackoff
            github.com/icinga/icingadb/pkg/retry/retry.go:45
    github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
            github.com/icinga/icingadb/pkg/icingadb/db.go:389
    golang.org/x/sync/errgroup.(*Group).Go.func1
            golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
    runtime.goexit
            runtime/asm_amd64.s:1594
    can't retry
    github.com/icinga/icingadb/pkg/retry.WithBackoff
            github.com/icinga/icingadb/pkg/retry/retry.go:64
    github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
            github.com/icinga/icingadb/pkg/icingadb/db.go:389
    golang.org/x/sync/errgroup.(*Group).Go.func1
            golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
    runtime.goexit
            runtime/asm_amd64.s:1594

Error 1452: Cannot add or update a child row: a foreign key constraint fails ("icingadb"."user_notification_history", CONSTRAINT "fk_user_notification_history_notification_history" F>
    can't perform "INSERT INTO \"user_notification_history\" (\"notification_history_id\", \"user_id\", \"environment_id\", \"id\") VALUES (:notification_history_id,:user_id,:environment>
    github.com/icinga/icingadb/internal.CantPerformQuery
          github.com/icinga/icingadb/internal/internal.go:30
    github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
          github.com/icinga/icingadb/pkg/icingadb/db.go:394
    github.com/icinga/icingadb/pkg/retry.WithBackoff
          github.com/icinga/icingadb/pkg/retry/retry.go:45
    github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
          github.com/icinga/icingadb/pkg/icingadb/db.go:389
    golang.org/x/sync/errgroup.(*Group).Go.func1
          golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
    runtime.goexit
          runtime/asm_amd64.s:1594
    can't retry
    github.com/icinga/icingadb/pkg/retry.WithBackoff
          github.com/icinga/icingadb/pkg/retry/retry.go:64
    github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
          github.com/icinga/icingadb/pkg/icingadb/db.go:389
    golang.org/x/sync/errgroup.(*Group).Go.func1
          golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
    runtime.goexit
          runtime/asm_amd64.s:1594

how can we further troubleshoot this problem?

julianbrost commented 1 year ago

What are the answers to the questions I asked in https://github.com/Icinga/icingadb/issues/577#issuecomment-1563878265 in your setup?

After this crash happened, did a restart suffice to get things running again?

Do you run Icinga DB in a high-availability configuration?

NavidSassan commented 1 year ago

yes, a restart was enough
yes, there are two instances of redis and icingadb which are sharing one mariadb database. both icinga2 master have the icingadb feature enabled. the environment id is the same on both servers

log1-c commented 1 year ago

Jul 11 05:08:59 <hostname> icingadb[2250476]: pq: unexpected message 'E'; expected ReadyForQuery
                                               can't perform "INSERT INTO \"state_history\" (\"previous_soft_state\", \"output\", \"soft_state\", \"state_type\", \"previous_hard_state\", \"check_attempt\",>
                                               github.com/icinga/icingadb/internal.CantPerformQuery
                                                       github.com/icinga/icingadb/internal/internal.go:30
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:394
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:45
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:389
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1594
                                               can't retry
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:64
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:389
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1594
Jul 11 05:08:59 <hostname> systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 11 05:08:59 <hostname> systemd[1]: icingadb.service: Failed with result 'exit-code'.

seeing a pretty similar error message. setup: 2 icinga masters, both with icingadb installed/enabled pgsql as database

crash happend on master1, on master2 it is still running.

a restart of the daemon was possible and it is running again.

Jul 13 11:06:09 <hostname> systemd[1]: Starting Icinga DB...
Jul 13 11:06:09 <hostname> icingadb[3569944]: Starting Icinga DB
Jul 13 11:06:09 <hostname> icingadb[3569944]: Connecting to database at 'localhost:5432'
Jul 13 11:06:09 <hostname> systemd[1]: Started Icinga DB.
Jul 13 11:06:09 <hostname> icingadb[3569944]: Connecting to Redis at 'localhost:6380'
Jul 13 11:06:09 <hostname> icingadb[3569944]: Starting history sync
Jul 13 11:06:10 <hostname> icingadb[3569944]: heartbeat: Received Icinga heartbeat
Jul 13 11:06:10 <hostname> icingadb[3569944]: high-availability: Another instance is active

log1-c commented 1 year ago

Just a follow up. This weekend the daemon crashed on both masters. Again with the "INSERT_INTO" error message. This time the second master was rebooted due to automatic updates at that time and then also failed (same message).

We are running Postgres as a backend for IcingaDB. Postgres is running as a Patroni cluster with 3 nodes. On the Icinga Masters we use pgbouncer and haproxy for the connection to the database. pgbounder listens for the connections from icinga and haproxy is setup to present on of the servers on localhost to pgbouncer.

After some testing we experienced the following:

If a config deployment (or presumably any action that triggers a reload of the icinga2 service) happens at the time of a switch-over of the leading node, we get the "insert into" error:

Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597b190: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597acd0: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 pooler error: server conn crashed?
Jul 17 15:42:50 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:34438 login attempt: db=icingadb user=icingadb tls=no
Jul 17 15:42:50 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 new connection to server (from 127.0.0.1:43752)
Jul 17 15:42:52 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=2s)
Jul 17 15:42:52 ma02 pgbouncer[875]: stats: 0 xacts/s, 1 queries/s, in 556 B/s, out 75 B/s, xact 37939 us, query 9738 us, wait 0 us
Jul 17 15:42:53 ma02 icingadb[205409]: pq: unexpected message 'E'; expected ReadyForQuery#012can't perform "INSERT INTO \"state_history\" (\"state_type\", \"check_source\", \"id\", \"previous_soft_state\", \"max_check_attempts\", \"previous_hard_state\", \"check_attempt\", \"scheduling_source\", \"endpoint_id\", \"host_id\", \"object_type\", \"service_id\", \"event_time\", \"soft_state\", \"hard_state\", \"output\", \"long_output\", \"environment_id\") VALUES (:state_type,:check_source,:id,:previous_soft_state,:max_check_attempts,:previous_hard_state,:check_attempt,:scheduling_source,:endpoint_id,:host_id,:object_type,:service_id,:event_time,:soft_state,:hard_state,:output,:long_output,:environment_id) ON CONFLICT ON CONSTRAINT pk_state_history DO UPDATE SET \"id\" = EXCLUDED.\"id\""#012github.com/icinga/icingadb/internal.CantPerformQuery#012#011github.com/icinga/icingadb/internal/internal.go:30#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:394#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Failed with result 'exit-code'.

Without a config deploy at switch-over time the error message changes to:

Jul 17 15:48:06 ma02 icingadb[348949]: pq: cannot use serializable mode in a hot standby#012can't start transaction#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:253#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA aborted#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:128#012sync.(*Once).doSlow#012#011sync/once.go:74#012sync.(*Once).Do#012#011sync/once.go:65#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:126#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:218#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA exited with an error#012main.run#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:335#012main.main#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:37#012runtime.main#012#011runtime/proc.go:250#012runtime.goexit#012#011runtime/asm_amd64.s:1594

Looks like the icingadb daemon doesn't like it if the db cluster isn't available for even a short time.

julianbrost commented 1 year ago

@log1-c That's a different issue from what was originally reported here. Please open a new issue for that.

Al2Klimov commented 1 year ago

We have several options:

Delay progress by X between stages such as state_history and history, so that the cluster has enough time to sync the rows
Re-try FK errors
Drop our FKs
Connect with the MySQL variable wsrep_sync_wait=4, catch the error 1193 unknown system variable and if it occurs, connect without it

julianbrost commented 1 year ago

Delay progress by X between stages such as state_history and history, so that the cluster has enough time to sync the rows

Would there be some time that is guaranteed to be enough? (I doubt it, we're not doing real-time stuff here.)

Drop our FKs

That would break the history cleanup.

What about the idea/suggestion you mentioned at some point: pinning all inserts for the same history event to the same connection? (1. If we can do that, would that work reliably? 2. Can this be implemented feasibly with Go sql/sqlx?)

Al2Klimov commented 1 year ago

What about the idea/suggestion you mentioned at some point: pinning all inserts for the same history event to the same connection? (1. If we can do that, would that work reliably? 2. Can this be implemented feasibly with Go sql/sqlx?)

It would break re-trials of broken connections if not done via transactions.

julianbrost commented 1 year ago

It would break re-trials of broken connections if not done via transactions.

What would be the problem? Shouldn't (1) get a single database connection (2) send query A (3) send query B referencing row created by A do the job if it's retried from the beginning if the connection fails at any point?

I haven't looked into what code changes would exactly be necessary to support this, but conceptually this sounds like my preferred solution.

The next best solution would probably be wsrep_sync_wait, but that feels more like a quick and dirty fix compared to a proper solution. What would be interesting though: can you set this on the server side as an immediate workaround?

Al2Klimov commented 1 year ago

Now as you say it:

@NavidSassan @patrickesser Please try wsrep_sync_wait=4 in your MySQL config. It works for me with https://github.com/Al2Klimov/provoke-galera , but -as you may guess- these are laboratory conditions.

While I'm on it: quick Galera test cluster guide for my colleagues

On both Debian 12 nodes:

apt install mariadb-server galera-4
vim /etc/mysql/mariadb.conf.d/60-galera.cnf
- uncomment Mandatory settings and bind-address
- put the first node's IP after gcomm://
- add wsrep_provider = /usr/lib/galera/libgalera_smm.so

On first node run:

galera_new_cluster
iptables -A OUTPUT -d SECOND_NODE_IP -m limit --limit 100/sec -j ACCEPT
iptables -A OUTPUT -d SECOND_NODE_IP -j DROP

On second node run: systemctl restart mariadb.service

wsrep_sync_wait, but that feels more like a quick and dirty fix compared to a proper solution.

I don't agree. E.g. Eric wants Vitess support to be implemented as (1) are you Vitess? Yes? (2) Tolerate unfiltered SELECTs! Do it now!! My approach would more/less join those ranks. After all we also check the schema version once at startup. The same-connection solution would be dirty, but not quick. I'd prefer not to fiddle DB conns around, but just say the database (or even recommend the users?) what we want: Please do wsrep_sync_wait=4, thank you. (Why isn't an even stricter wsrep_sync_wait the default btw.?) Also, with only one/few DB conn(s) allowed, you'd get a general performance problem.

julianbrost commented 1 year ago

(Why isn't an even stricter wsrep_sync_wait the default btw.?)

Because it isn't free and comes with its own drawbacks. Excerpt from the documentation:

While the check is taking place, new queries are blocked on the node to allow the server to catch up with all updates made in the cluster up to the point where the check was begun. Once reached, the original query is executed on the node. This can result in higher latency.

So doing the change in the way you suggest, this will incur a latency penalty for all inserts. So we should at least consider only setting it for queries that require it.

E.g. Eric wants Vitess support to be implemented as (1) are you Vitess? Yes? (2) Tolerate unfiltered SELECTs! Do it now!!

I haven't looked into the details of that, so I can't say whether this makes sense. One has to look in the advantages and drawbacks there as well.

Al2Klimov commented 1 year ago

E.g. Eric wants Vitess support to be implemented as (1) are you Vitess? Yes? (2) Tolerate unfiltered SELECTs! Do it now!!

I haven't looked into the details of that, so I can't say whether this makes sense. One has to look in the advantages and drawbacks there as well.

I mean, I also could do SHOW VARIABLES LIKE if you consider that less dirty, should I?

While the check is taking place, new queries are blocked on the node to allow the server to catch up with all updates made in the cluster up to the point where the check was begun. Once reached, the original query is executed on the node. This can result in higher latency.

So doing the change in the way you suggest, this will incur a latency penalty for all inserts. So we should at least consider only setting it for queries that require it.

Well, we could set that variable only for (history) rows which are referenced by others using a separate sql.DB instance. But a second sql.DB would either circumvent our overall connection limit (significantly?) if not restricted to one connection (per table?). Do you prefer this over setting that variable everywhere?

julianbrost commented 1 year ago

I mean, I also could do SHOW VARIABLES LIKE if you consider that less dirty, should I?

It would be nicer if we didn't have to set that variable at all. Would sending inserts for state_history and history over the connection have downsides for a non-clustered database? I don't see any, so that would be a fix without any special-casing involved.

Well, we could set that variable only for (history) rows which are referenced by others using a separate sql.DB instance. But a second sql.DB would either circumvent our overall connection limit (significantly?) if not restricted to one connection (per table?). Do you prefer this over setting that variable everywhere?

Why a separate sql.DB? What about: get connection from pool, set variable, perform insert, reset variable, return connection to pool?

Al2Klimov commented 1 year ago

Do we set any of our vars like this? No. (AFAIK Eric neither would support Vitess like this.) I don't like the idea of fiddling around with connections or even max connections (per table) semaphores. Neither the idea that different DB connections work different. Unnecessary to mention that I still prefer my solution. Let's wait for both Eric's opinion as professional and our two users' test results.

Apropos. I guess one can also SET GLOBAL w/o restarting the cluster.

julianbrost commented 1 year ago

Do we set any of our vars like this?

What other variables do we set? Do they have downsides as well like adding latency to many queries (where most don't gain anything from that latency)?

Al2Klimov commented 1 year ago

sql_mode=ANSI_QUOTES
No. But as I already mentioned, IMAO wsrep_sync_wait=4 or even stricter should be the opt-out default, so that clusters behave as expected by clients (like ours), like single nodes. Also history -the only subsystem that benefits from this- has the largest tables and the most stuff to sync, so I consider OK rolling out that variable on everything. Additional advantage: we'd only delete anything from any(!) history stream once we're absolutely sure that it got synced over the whole(!) cluster.

julianbrost commented 10 months ago

sql_mode=ANSI_QUOTES

This option is intended to affect each query as that's the quoting we currently use in all queries (though that will probably also go away with #606). Apart from that, this should not affect any query performance significantly.

No. But as I already mentioned, IMAO wsrep_sync_wait=4 or even stricter should be the opt-out default, so that clusters behave as expected by clients (like ours), like single nodes.

Maybe yes, but given that they've chosen the default differently and the documentation explicitly mentions query performance, I wonder how big the impact is, so some numbers would be interesting. It wouldn't help if it's correct but significantly reduces history throughput. On the other hand, inserting all rows for one event over the same connection should have no impact (shouldn't this even be pretty what would happen at the moment if the connection pool would magically choose the same connection again?).

Additional advantage: we'd only delete anything from any(!) history stream once we're absolutely sure that it got synced over the whole(!) cluster.

I'm not sure if that would actually have an impact on that. My understanding of the documentation is that it would wait before starting the INSERT that anything before was synced to the local node, not that the operation was performed on all nodes before reporting success.

Al2Klimov commented 10 months ago

ref/IP/46635

oxzi commented 6 months ago

Locally, I set up a small Galera cluster to test the error and the solutions mentioned here.

In a nutshell, the setup looked like this:

Three MariaDB Nodes, building a Galera cluster
HAProxy as a L4 Load Balancer to access the Galera cluster (allows selected traffic distribution)
Two Icinga 2 masters with their own Icinga DB and Redis in an HA scenario
Multiple dummy hosts to be monitored by Icinga 2

The HAProxy was configured

to randomly balance each connection to one of the three MariaDB nodes and
to enable its statistics reporting web interface to inspect how many connections are hitting which backend.

Due to the fact that Icinga DB utilizes multiple database connections, max_connections defaulting to 16, and HAProxy balances connections randomly, Icinga DB established connections to different MariaDB cluster nodes.

After setting up, the system operated for an estimated hour without any problems.

However, after a restart, the first Icinga node's Icinga DB crashed with a FOREIGN KEY constraint violation directly after starting up. This error looks very similar to the one in the original post. Fortunately, the error can be reproduced fairly well from this point onwards.

2024-02-20T14:34:11.256Z    INFO    icingadb        Starting Icinga DB
2024-02-20T14:34:11.256Z    INFO    icingadb        Connecting to database at 'mysql:3306'
2024-02-20T14:34:11.259Z    INFO    icingadb        Connecting to Redis at 'redis-master1:6380'
2024-02-20T14:34:11.260Z    DEBUG   icingadb        Redis schema version is correct
2024-02-20T14:34:11.260Z    DEBUG   icingadb        Checking Icinga 2 and Icinga DB compatibility
2024-02-20T14:34:11.260Z    DEBUG   high-availability       Starting HA     {"instance_id": "701bc13acd8d4fa88296ffee2b705a39"}
2024-02-20T14:34:11.260Z    INFO    icingadb        Starting history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting notification history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting state history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting downtime history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting comment history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting flapping history sync
2024-02-20T14:34:11.260Z    DEBUG   history-sync    Starting acknowledgement history sync
2024-02-20T14:34:11.448Z    DEBUG   database        Finished executing "INSERT INTO \"history\" (\"environment_id\", \"service_id\", \"event_type\", \"object_type\", \"host_id\", \"id\", \"state_history_id\", \"event_time\", \"endpoint_id\") VALUES (:environment_id,:service_id,:event_type,:object_type,:host_id,:id,:state_history_id,:event_time,:endpoint_id) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")" with 0 rows in 118.491144ms
2024-02-20T14:34:11.448Z    DEBUG   database        Finished executing "INSERT INTO \"state_history\" (\"event_time\", \"state_type\", \"check_attempt\", \"hard_state\", \"long_output\", \"max_check_attempts\", \"scheduling_source\", \"endpoint_id\", \"id\", \"object_type\", \"service_id\", \"environment_id\", \"host_id\", \"soft_state\", \"previous_soft_state\", \"previous_hard_state\", \"output\", \"check_source\") VALUES (:event_time,:state_type,:check_attempt,:hard_state,:long_output,:max_check_attempts,:scheduling_source,:endpoint_id,:id,:object_type,:service_id,:environment_id,:host_id,:soft_state,:previous_soft_state,:previous_hard_state,:output,:check_source) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")" with 5005 rows in 149.674765ms
2024-02-20T14:34:11.451Z    FATAL   icingadb        Error 1452 (23000): Cannot add or update a child row: a foreign key constraint fails ("icingadb"."history", CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE)
can't perform "INSERT INTO \"history\" (\"environment_id\", \"service_id\", \"event_type\", \"object_type\", \"host_id\", \"id\", \"state_history_id\", \"event_time\", \"endpoint_id\") VALUES (:environment_id,:service_id,:event_type,:object_type,:host_id,:id,:state_history_id,:event_time,:endpoint_id) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")"
github.com/icinga/icingadb/internal.CantPerformQuery
    /go/src/github.com/Icinga/icingadb/internal/internal.go:30
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
    /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:391
github.com/icinga/icingadb/pkg/retry.WithBackoff
    /go/src/github.com/Icinga/icingadb/pkg/retry/retry.go:49
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
    /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:386
golang.org/x/sync/errgroup.(*Group).Go.func1
    /go/pkg/mod/golang.org/x/sync@v0.5.0/errgroup/errgroup.go:75
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1695
can't retry
github.com/icinga/icingadb/pkg/retry.WithBackoff
    /go/src/github.com/Icinga/icingadb/pkg/retry/retry.go:68
github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
    /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:386
golang.org/x/sync/errgroup.(*Group).Go.func1
    /go/pkg/mod/golang.org/x/sync@v0.5.0/errgroup/errgroup.go:75
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1695
exit status 1

The MariaDB node receiving the faulty query stored the following error:

MariaDB [(none)]> show engine innodb status\G
[. . .]
------------------------
LATEST FOREIGN KEY ERROR
------------------------
2024-02-20 14:34:11 0x7f023a2fe640 Transaction:
TRANSACTION 22514, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 1
MariaDB thread id 9, OS thread handle 139647542879808, query id 31 192.168.0.8 icingadb Update
INSERT INTO "history" ("environment_id", "service_id", "event_type", "object_type", "host_id", "id", "state_history_id", "event_time", "endpoin
t_id") VALUES (?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,
?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?
,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,
?,?),(?,?,?,?,?,?,?,?,?),(?,
Foreign key constraint fails for table "icingadb"."history":
,
  CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE in parent table, i
n index idx_history_state tuple:
DATA TUPLE: 2 fields;
 0: len 20; hex 48037240963505d5062d1afa0391f3d382931c01; asc H r@ 5   -          ;;
 1: len 20; hex 2b5047c920f5a76634967ac64be62d59ada90afc; asc +PG    f4 z K -Y    ;;

But in parent table "icingadb"."state_history", in index PRIMARY,
the closest match we can find is record:
PHYSICAL RECORD: n_fields 20; compact format; info bits 0
 0: len 20; hex 480f2d2d601202226dcbf9b07858a210d3dc067d; asc H --`  "m   xX     };;
 1: len 6; hex 000000000000; asc       ;;
 2: len 7; hex 80000000000000; asc        ;;
 3: len 20; hex df8a471cac09c1f48168e45050d98a26632b6253; asc   G      h PP  &c+bS;;
 4: len 20; hex cca97d78c2ded0024d2999a2e42f891bc1f4d953; asc   }x    M)   /     S;;
 5: len 1; hex 01; asc  ;;
 6: len 20; hex a01d2130bdb887a845c411cc21130970a264cbe1; asc   !0    E   !  p d  ;;
 7: SQL NULL;
 8: len 8; hex 0000018dc6a34cc6; asc       L ;;
 9: len 1; hex 01; asc  ;;
 10: len 1; hex 00; asc  ;;
 11: len 1; hex 00; asc  ;;
 12: len 1; hex 63; asc c;;
 13: len 1; hex 63; asc c;;
 14: len 1; hex 01; asc  ;;
 15: len 30; hex 596f752077696c6c2072656d656d6265722c20576174736f6e2c20686f77; asc You will remember, Watson, how; (total 59 bytes);
 16: len 30; hex 416265726e657474792066616d696c79207761732066697273742062726f; asc Abernetty family was first bro; (total 140 bytes);
 17: len 4; hex 00000003; asc     ;;
 18: len 15; hex 6963696e6761322d6d617374657231; asc icinga2-master1;;
 19: len 15; hex 6963696e6761322d6d617374657231; asc icinga2-master1;;
[. . .]

These logs provide a number of insights and allow us to draw conclusions and make assumptions.

There are two SQL queries being logged, resulting into violating fk_history_state_history.
Both queries are being DEBUG logged as finished at the exact same time. This, however, is mostly misleading and the consumed time, 118.491144ms and 149.674765ms respectively, should be considered.
The first query inserting into state_history (logged second) created 5005 rows in 149.674765ms.
The second query inserting into history (logged first) created zero rows and produces fk_history_state_history violation after 118.491144ms.
Thus, the first query actually started before the second. I would guess that after the second query failed, the context was canceled, resulting in both queries being defer db.loged.

The history synchronization logic is a pipeline reading from Redis, piping through stageFuncs, and eventually being removed from Redis again:

https://github.com/Icinga/icingadb/blob/6c8b52f2033cd94466863c92d3df632e3c87743c/pkg/icingadb/history/sync.go#L51-L69

https://github.com/Icinga/icingadb/blob/6c8b52f2033cd94466863c92d3df632e3c87743c/pkg/icingadb/history/sync.go#L361-L365

Going down the rabbit hole^W^W call stack, writeOneEntityStage → writeMultiEntityStage → DB.UpsertStreamed → DB.NamedBulkExec where, eventually, each query is being submitted.

Due to the pipeline architecture, this should not be a problem as the queries are submitted sequentially one after the other. But this assumption is incorrect if the database is no longer a single server, but rather a cluster which is not always fully synchronized.

Several solutions have already been mentioned in this thread - some more, some less hacky - of which I have tried out some. Below I am going to list what I have tried and also some further ideas, including what I would not choose.

Configuring wsrep_sync_wait=4 through @Al2Klimov's PR #665 for every DB connection seems to work fine. However, enforcing a synchronization delay for all INSERT and REPLACE queries seems to be an overkill.
As this only affects certain INSERTs, those types could, e.g., implement some attribute or an interface, resulting in wsrep_sync_wait being set accordingly only for the corresponding queries. Fortunately, the dependencies are noted as comments in the syncPipelines map.
Purely time-based waiting until the cluster is synchronized is not feasible in my opinion, as it is not possible to estimate when the cluster will be synchronized, especially under high load.
A synchronization stage in the pipeline could be considered, which waits for the previous stage to be completed, then requests a single synchronization and then continues with the next stage.

However, this would have (at least) two disadvantages:
- In the worst case, the synchronization does not affect the next used node, becoming useless.
- This contradicts the pipeline concept and costs additional time and buffer memory.
I would therefore be reluctant to interrupt or buffer the pipeline.
Go's database/sql - what sqlx builds on - does connection pooling. However, it should be possible to get a single distinct connection, which can be passed through the pipeline. With a certain degree of refactoring, this should work.
Otherwise, the number of connections could be reduced to just a single one through Icinga DB's max_connections configuration settings. However, this comes at the expense of performance and may still fail if the connection is disconnected and then reestablished to another MariaDB node.
A more reasonable version of this idea would be to always promote the same node for all connections, independent of Icinga DB. With a load balancer - such as HAProxy - this is possible by distributing the load to the same node at all times.

I altered my HAProxy setup
- to balance source, resulting in the same source IP hitting the same backend, and
- to balance first, resulting in the same backend being used by everyone.
The latter is essentially equivalent to assigning a virtual IP as @epinter did in #651. But, to put this into context, this degrades the cluster to a glorified primary-(multi-)replica setup.

BTW, both configurations have mitigated the error and were working for me.

To put things into perspective, both Go's sql library and Galera are clustering connections. As a result, even a single Icinga DB node might hold connections to different MariaDB nodes and, if SQL transactions are not explicitly used, can cause an out-of-order error. The devil is in the details, I guess.

The points 1. and 2. from the list above are addressing this issue by explicitly telling Galera nodes to synchronize before performing an INSERT. This is at the expense of a higher latency.

While @Al2Klimov's approach includes every INSERT, 2. would only address history and user_notification_history. On the downside, 2. would require a bigger change, which itself might introduce new bugs.

As a non-representative data point, here are my table sizes (rough sizes, because InnoDB):

MariaDB [icingadb]> SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'icingadb';
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          431933 |
+-----------------+
1 row in set (0.001 sec)

MariaDB [icingadb]> SELECT TABLE_NAME, TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'icingadb' ORDER BY TABLE_ROWS DESC LIMIT 10;
+---------------------+------------+
| TABLE_NAME          | TABLE_ROWS |
+---------------------+------------+
| history             |      64533 |
| servicegroup_member |      54885 |
| state_history       |      54371 |
| host_customvar      |      47975 |
| service_customvar   |      45562 |
| hostgroup_member    |      38799 |
| sla_history_state   |      33485 |
| service             |      30866 |
| service_state       |      29397 |
| host                |      10287 |
+---------------------+------------+
10 rows in set (0.001 sec)

MariaDB [icingadb]> SELECT TABLE_NAME, TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'icingadb' AND TABLE_NAME IN ('history', 'user_notification_history');
+---------------------------+------------+
| TABLE_NAME                | TABLE_ROWS |
+---------------------------+------------+
| user_notification_history |          5 |
| history                   |      64540 |
+---------------------------+------------+
2 rows in set (0.001 sec)

Even with history being the most populated table, the issue should only affect roughly 15%, (64540+5)/431933 = 0.1494, of all rows. Whether a more selective wsrep_sync_wait is worth the extra effort, completely depends on how expensive the forced synchronization itself is.

Points 3. and 4. have already been rejected in the arguments above.
I think the 5. point is worth noting and I would further investigate this approach. At the moment, I cannot even say if this would work - even after a successfully refactoring - and how this might affect performance.
Point 6. can also be ignored.
Finally, point 7. is a mitigation, which requires additional software and also might degrade the Galera cluster's performance.

To summarize, I would see if the costs of wsrep_sync_wait can be quantified and evaluate whether distinct connections can be integrated into the pipeline.

julianbrost commented 6 months ago

Go's database/sql - what sqlx builds on - does connection pooling. However, it should be possible to get a single distinct connection, which can be passed through the pipeline. With a certain degree of refactoring, this should work.

Feel free to give it a try. The tricky part will be the error handling for when that connection fails (the connection you'd retry it on could be to another node, so that might not have seen the previous stage yet).

oxzi commented 6 months ago

While reducing the connection pool to a distinct connection seemed like a good idea in the first place, it comes with multiple issues when being implemented. For example, one has to reimplement the reconnection logic and there is even some missing sqlx support, jmoiron/sqlx#780.

Thus, I have eventually dropped this idea and went with another approach: using a distinct DB with only one allowed DB connection for the history sync. Feel free to take a look at #681.

oxzi commented 6 months ago

As foreshadowed in my other post, I tried to get some somehow reliable numbers. Thus, I benchmarked the SQL query execution time from within Icinga DB in different scenarios.

There were three different database scenarios:

single-ha-none is a single MariaDB node without any Load Balancer or clustering whatsoever.
galera-ha-balance-first is a three node MariaDB Galera cluster with a Load Balancer, always using the same node, acting like a virtual IP.
galera-ha-balance-random is the same three node MariaDB Galera cluster with a Load Balancer, but now assigning connections to random database servers.

Then, different Icinga DB flavors were used:

main is the current main branch at 0e9810c1e7d5b619fec48849698ff14a0526176f.
pr665-pr679 is #665 and #679.
pr681-pr679 is #681 and #679.

Furthermore, to simulate a real network, some delays were added: delay-none, delay-10ms, and delay-100ms. The delay was set manually after starting the setup through a command like the following:

for i in $(seq 1 3); do docker exec mysql-n${i} tc qdisc add dev eth0 root netem delay 100ms; done

Thus, it might happen that some of the first connections were not limited. Note that this applies to both incoming and outgoing traffic, i.e., a 10ms delay leads to 20ms for a ping.

The time was taken by debug logging the execution.

Logging Patch

```diff diff --git a/pkg/icingadb/db.go b/pkg/icingadb/db.go index 4ff3e0d..e5f451f 100644 --- a/pkg/icingadb/db.go +++ b/pkg/icingadb/db.go @@ -386,10 +386,12 @@ func (db *DB) NamedBulkExec( return retry.WithBackoff( ctx, func(ctx context.Context) error { + t0 := time.Now() _, err := db.NamedExecContext(ctx, query, b) if err != nil { return internal.CantPerformQuery(err, query) } + db.logger.Debugf("Executed Query\t%q\t%d\t%v", query, len(b), time.Since(t0)) counter.Add(uint64(len(b))) ```

This resulted in different runs, each roughly five minutes long, and named like ${database-scenario}-${icingadb-flavor}-${delay}. I only tested those combinations that I considered reasonable.

A small statistic was extracted from the logs, showing mean execution time, median execution time, and the standard deviation.

Script

This AWK script requires the `gawk` interpreter. ```awk # time_normalize a Go time.Duration to a number representing microseconds. function time_normalize(t) { if (match(t, /([0-9]+\.[0-9]+)s$/, a)) return a[1] * 1000.0 else if (match(t, /([0-9]+\.[0-9]+)ms$/, a)) return a[1] else if (match(t, /([0-9]+)\.[0-9]+µs$/, a)) return a[1] / 1000.0 printf("%s:%d: cannot parse time duration: %s\n", FILENAME, FNR, t) > "/dev/stderr" exit 1 } BEGIN { print "| Filename | Mean | Median | SD |" print "| -------- | ---- | ------ | -- |" } /database\tExecuted Query/ { times[times_no++] = time_normalize($NF) } ENDFILE { asort(times) sum = 0 for (i in times) sum += times[i] mean = sum / times_no median = times[int(times_no/2)] variance = 0 for (i in times) variance += (times[i] - mean)^2 variance = sqrt(variance/times_no) sd = sqrt(variance) print "| " FILENAME " | " mean "ms | " median "ms | " sd "ms |" delete times times_no = 0 } ```

Filename	Mean	Median	SD
single-ha-none-main-delay-none	3.94283ms	3.168657ms	2.09901ms
galera-ha-balance-first-main-delay-none	3.69792ms	2.582401ms	1.95272ms
galera-ha-balance-random-pr665-pr679-delay-none	6.27922ms	3.017348ms	3.1379ms
galera-ha-balance-random-pr681-pr679-delay-none	6.47338ms	3.014271ms	3.57157ms
single-ha-none-main-delay-10ms	25.9425ms	23.304029ms	4.14129ms
galera-ha-balance-first-main-delay-10ms	64.7782ms	53.42998ms	9.38948ms
galera-ha-balance-random-pr665-pr679-delay-10ms	72.4412ms	54.419461ms	9.09667ms
galera-ha-balance-random-pr681-pr679-delay-10ms	48.0431ms	53.711958ms	6.05208ms
single-ha-none-main-delay-100ms	247.345ms	204.465222ms	12.5254ms
galera-ha-balance-first-main-delay-100ms	597.217ms	503.405229ms	24.2195ms
galera-ha-balance-random-pr665-pr679-delay-100ms	685.79ms	644.591792ms	19.3462ms
galera-ha-balance-random-pr681-pr679-delay-100ms	622.714ms	505.723366ms	21.7324ms

A first look shows a significant increase in time for a Galera cluster when a longer delay is set. When compared to a single MariaDB, this is particularly noticeable: 204.5ms vs 503.4ms/644.6ms/505.7ms in the median.

Within the Galera measurements, the pr665-pr679 stands out with a bigger delay, when a delay between the nodes was configured. This is not really surprising, as setting wsrep_sync_wait=4 forces a cluster synchronization before executing the query. The other two Galera options - either to the same node through a LB or with a distinct connection - are roughly similar.

What absolutely should be mentioned is that #681 has errors, at least the log looked suspicious and the CI of the PR still fails. Furthermore, this adds additional code with more logic, without having a relevant time advantage over a load balancer or a virtual IP.

Thus, even if this is not visible from the data at first glance, I would recommend @Al2Klimov's PR #665 over #681, but change it to make the functionality optionally configurable. Thus, a DBA can decide whether Icinga DB enforces synchronization - costing some additional time - or whether a load balancer, a virtual IP or the like should be configured. One should be aware that a Galera Cluster is not always a drop-in replacement for a single database server or a classic replication.

Al2Klimov commented 6 months ago

a Galera Cluster is not always a drop-in replacement for a single database server

Unfortunately. I try to emulate exactly that with

665

But if the delay is too bad, we could also re-try FK errors (for 5 minutes) instead.

oxzi commented 6 months ago

But if the delay is too bad, we could also re-try FK errors (for 5 minutes) instead.

I am not quite sure if I am in favor of the "retry almost everything after/for n minutes"-idea. For example, #651 experienced frequent Galera-related deadlock errors. Retrying everything might stutter the query out of the undesired situation, but would unnecessary slow down everything.

Thus, as I have written before, I would support your wsrep_sync_wait=4 PR, but make it configurable in the config file.

Al2Klimov commented 6 months ago

Not everything, just specific X for the clear reason Y. I mean, if it's an actual FK error, it will persist and crash the whole thing after 5m. If it's a cluster thing, it'll go away. No configuration needed and no performance decrease. 👍

julianbrost commented 6 months ago

When compared to a single MariaDB, this is particularly noticeable: 204.5ms vs 503.4ms/644.6ms/505.7ms in the median.

Within the Galera measurements, the pr665-pr679 stands out with a bigger delay, when a delay between the nodes was configured. This is not really surprising, as setting wsrep_sync_wait=4 forces a cluster synchronization before executing the query.

One thing that confuses me: the numbers suggest that the impact of wsrep_sync_wait=4 depends on the latency but on the other hand, it's not explainable by an extra round-trip (as that should add 20ms or 200ms with the different latency values).

Also, with reasonable latency (remember, 200ms is in the once around the whole globe order of magnitude), the impact of wsrep_sync_wait=4 seems quite insignificant actually. So I don't see much reason against just setting it.

I don't really understand Galera, I'd have expected a more drastic difference for it to be worth to sacrifice some consistency guarantees.

oxzi commented 6 months ago

One thing that confuses me: the numbers suggest that the impact of wsrep_sync_wait=4 depends on the latency but on the other hand, it's not explainable by an extra round-trip (as that should add 20ms or 200ms with the different latency values).

I took another look at the logs, even cleaned out the first 20% of entries - to really make sure that the tc command was already executed - and the last 20% - because I stopped the Galera cluster node by node to avoid conflicts on the next start. However, the fastest query with a 100ms delay on pr665-pr679 took 200.976409ms. Thus, it was directly executed on the database node.

The log line in question was:

icingadb-master1 | 2024-03-05T14:53:31.375Z DEBUG database Executed Query"INSERT INTO \"state_history\" (\"object_type\", \"id\", \"event_time\", \"soft_state\", \"previous_soft_state\", \"previous_hard_state\", \"check_attempt\", \"long_output\", \"scheduling_source\", \"environment_id\", \"state_type\", \"host_id\", \"endpoint_id\", \"check_source\", \"hard_state\", \"max_check_attempts\", \"service_id\", \"output\") VALUES (:object_type,:id,:event_time,:soft_state,:previous_soft_state,:previous_hard_state,:check_attempt,:long_output,:scheduling_source,:environment_id,:state_type,:host_id,:endpoint_id,:check_source,:hard_state,:max_check_attempts,:service_id,:output) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")" 1 200.976409ms

This was just a guess, but this INSERT query ends with ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")" and setting wsrep_sync_wait=4 only covers INSERT and REPLACE, not UPDATE.

Thus, I changed the wsrep_sync_wait value to 6 and even 15.. and the fastest queries still took round about 200ms. The other metrics have also not changed relevantly.

I don't really understand Galera, I'd have expected a more drastic difference for it to be worth to sacrifice some consistency guarantees.

I'll second that.

What kind of magic does it do? I would really like to find some technical documentation about Galera, describing how it works internally.

slalomsk8er commented 6 months ago

Not everything, just specific X for the clear reason Y. I mean, if it's an actual FK error, it will persist and crash the whole thing after 5m. If it's a cluster thing, it'll go away. No configuration needed and no performance decrease. 👍

Even with wsrep_sync_wait, I've seen icingadb choke on something like ERROR 1047 (08S01): WSREP has not yet prepared node for application use so a retry might by in order for some, if not all errors. I propose usage of the SMTP retry logic, do a couple of them but in increasing intervals.

oxzi commented 6 months ago

@slalomsk8er: There is #679 addressing especially this error.

But, out of curiosity, does this only happen during cluster startup times or even sometimes in between?

slalomsk8er commented 6 months ago

@oxzi thanks for the link - I requested this issue via Linuxfabrik. I think, ERROR 1047 was independent of the icinga2 cluster startup and caused by network and/or maintenance work on the Galera nodes. Can't find this particular error in the current logs with journalctl -p 2 -u icingadb.service atm. I found a timeline of the error in graylog as recorded by the icingacli vpheredb checks and no correlation with icinga2 uptime: The bars are very wide and in reality a event looks more like this:

yhabteab commented 5 months ago

Fixed by https://github.com/Icinga/icingadb/pull/711

Icinga / icingadb

icingadb process crashes with Error 1452: Cannot add or update a child row: a foreign key constraint fails #577

Your Environment

While I'm on it: quick Galera test cluster guide for my colleagues

665