flatcar / nebraska

Update monitor & manager for applications using the Omaha protocol, optimized for Flatcar Container Linux.
https://kinvolk.io/docs/nebraska/latest
Apache License 2.0
167 stars 39 forks source link

Issues when running more than a replica for HA #388

Open JesusRo opened 3 years ago

JesusRo commented 3 years ago

Description

Errors running randomly when deployment has more than 1 replica

Impact

Operations are not performed (example: create group)

Environment and steps to reproduce

  1. Set-up: Deployed via helm chart with replicaCount: 2

Expected behavior

Operation are correctly applied

Additional information

runtime error: invalid memory address or nil pointer dereference
runtime/panic.go:199 (0x4451eb)
runtime/signal_unix.go:394 (0x445028)
github.com/kinvolk/nebraska@/cmd/nebraska/controller.go:339 (0xda8809)
github.com/gin-gonic/gin@v1.6.3/context.go:161 (0x999fba)
github.com/kinvolk/nebraska@/cmd/nebraska/controller.go:167 (0xda66ee)
github.com/gin-gonic/gin@v1.6.3/context.go:161 (0x999fba)
github.com/Depado/ginprom@v1.7.0/prom.go:323 (0xda4f7a)
github.com/gin-gonic/gin@v1.6.3/context.go:161 (0x999fba)
github.com/gin-gonic/gin@v1.6.3/recovery.go:83 (0x9ac873)
github.com/gin-gonic/gin@v1.6.3/context.go:161 (0x999fba)
github.com/gin-gonic/gin@v1.6.3/gin.go:409 (0x9a3efc)
github.com/gin-gonic/gin@v1.6.3/gin.go:367 (0x9a35fd)
net/http/server.go:2831 (0x72dc53)
net/http/server.go:1919 (0x7294f4)
runtime/asm_amd64.s:1357 (0x45d360)

<nil> ERR addGroup - adding group &{e4657731-2362-404e-a45c-ad14e9ee990c flatcar-k8s.test flatcar-k8s.test 0001-01-01 00:00:00 +0000 UTC false e96281a6-d1af-4bde-9a0a-97b76e56dc57 {{e06064ad-4414-4904-9a6e-fd465593d1b2 true}} true false false {{Europe/Madrid true}} 1 hours 999999 60 days <nil> flatcar-k8s.test} error="ERROR: duplicate key value violates unique constraint \"groups_application_id_name_key\" (SQLSTATE 23505)" context=nebraska

[...]

<nil> ERR processUpdate, adding package error="ERROR: duplicate key value violates unique constraint \"package_appid_version_arch_unique\" (SQLSTATE 23505)" arch=amd64 channel=edge context=syncer

[...]

<nil> INF processEvent eventError 0 appID={e96281a6-d1af-4bde-9a0a-97b76e56dc57} context=omaha event="update complete.success reboot" group=3c7828b9-94b6-4b64-ba88-7e9c1bffe23a previousVersion=
<nil> ERR RegisterEvent - could not get instance (propagates as ErrInvalidInstance) error="sql: no rows in result set" context=api
<nil> WRN processEvent error nebraska: invalid instance context=omaha
<nil> WRN processPing error sql: no rows in result set context=omaha

[...]

[GIN-debug] redirecting request 307: /v1/update --> /v1/update
<nil> INF processEvent eventError 0 appID={e96281a6-d1af-4bde-9a0a-97b76e56dc57} context=omaha event="update download started.success" group=3c7828b9-94b6-4b64-ba88-7e9c1bffe23a previousVersion=
<nil> INF processEvent eventError 0 appID={e96281a6-d1af-4bde-9a0a-97b76e56dc57} context=omaha event="update download finished.success" group=3c7828b9-94b6-4b64-ba88-7e9c1bffe23a previousVersion=
<nil> INF processEvent eventError 0 appID={e96281a6-d1af-4bde-9a0a-97b76e56dc57} context=omaha event="update complete.success" group=3c7828b9-94b6-4b64-ba88-7e9c1bffe23a previousVersion=
<nil> INF processEvent eventError 0 appID={e96281a6-d1af-4bde-9a0a-97b76e56dc57} context=omaha event="update complete.success reboot" group=3c7828b9-94b6-4b64-ba88-7e9c1bffe23a previousVersion=
<nil> ERR RegisterEvent - could not get instance (propagates as ErrInvalidInstance) error="sql: no rows in result set" context=api
<nil> WRN processEvent error nebraska: invalid instance context=omaha
<nil> WRN processPing error sql: no rows in result set context=omaha
<nil> INF processEvent eventError 268435490 appID={e96281a6-d1af-4bde-9a0a-97b76e56dc57} context=omaha event="update complete.error" group=3c7828b9-94b6-4b64-ba88-7e9c1bffe23a previousVersion=
<nil> WRN processEvent error nebraska: no update in progress context=omaha
joaquimrocha commented 2 years ago

@mkilchhofer , maybe you have insights on what we need to make multiple replicas behave better. It seems it's trying to insert the group multiple times.

mkilchhofer commented 2 years ago

I think the software developers need to make sure everything from the software-side supports multiple replicas (leader election, app clustering, etc.). Nebraska is using an OR mapper, right? Maybe this component keeps stuff in memory and already returns 200 OK to the client?

The only thing we can do IMHO on the Kubernetes side to mitigate the most critical pain points is to use sticky sessions on the Ingress, but config depends on what Ingress controller the endusers are using. For ingress-nginx:

ingress:
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "INGRESSCOOKIE"
    nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"

Edit: Oh the main problem here seems to be the syncer thing. I think we need app clustering here or something like only one instance syncs stuff from the upstream repo.

JesusRo commented 2 years ago

Hi @mkilchhofer, I was testing using sticky sessions (nginx ingress controller) using 3 replicas of nebraska and it was fine for a while but after more intense usage (creating groups, adding vms, etc) the db blew up. Started to see locked queries/inserts/updates, waring/errors about foreign keys, and similar messages on Nebraska about "Duplicated keys"

It might be happened that all was fine until syncer kicked in? I will try to bring up new environment without it and retest. I

'm curious too, if I could set any kind of active/standby stuff on the ingress controller, so when not having load balanced, it will increase availability in case of troubles on the node

Anyhow, if the app is not actually ready for HA, I would consider it more an enhancement than an issue per se

thanks!