debate-map / app

Monorepo for the client, server, etc. of the Debate Map website.
https://debatemap.app
MIT License
73 stars 16 forks source link

Fix that the NGF pod/deployment can occasionally "disappear"! #281

Closed Venryx closed 7 months ago

Venryx commented 7 months ago

Summary

General:

Possibly related:

Occurrences

Discovered: 2024-03-17 11:59am (PT, by Venryx)
Discovered: 2024-03-18 9:15pm (PT, by Jamie)
Discovered: 2024-03-19 3:44am (PT, by Venryx)
Discovered: 2024-03-19 5:34am (PT, by Venryx)
Discovered: 2024-03-19 8:29pm (PT, by Venryx)
Venryx commented 7 months ago

The below is some copied information obtained while debugging the issue. (search Slack DMs ~2024-03-18 for more details)


Okay, I checked the logs of the internal "systemd-journal", and the logs for the "loki visible" pods in each namespace, and the very first log that I was able to find (for the sequence that ends in the ngf destruction), is this line from above:

{"level":"info","ts":"2024-03-19T02:28:26Z","msg":"Reconciling the resource","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"sh.helm.release.v1.ngf.v11","namespace":"default"},"namespace":"default","name":"sh.helm.release.v1.ngf.v11","reconcileID":"be7ade12-dced-4459-a203-ee0a8433426c"}

With these fields on the log entry:

app: nginx-gateway-fabric
container: nginx-gateway
filename: /var/log/pods/default_ngf-nginx-gateway-fabric-7948556d5f-wzhnx_5428d529-0c7a-4ad0-9d22-7a02f69ec33b/nginx-gateway/0.log
instance: ngf
job: default/nginx-gateway-fabric
namespace: default
node_name: pool-15gb-node-651007
pod: ngf-nginx-gateway-fabric-7948556d5f-wzhnx
stream: stderr

That is, the first log entry in that sequence was made by the "nginfx-gateway-fabric" pod.

So the NGF pod is apparently looking at the Secret and trying to reconcile the cluster state to "upsert to it", for some reason. Maybe it's a timer, maybe it's something else.

To try to figure that out, I tracked down the line that is causing that log line (I think this is it, anyway): https://github.com/nginxinc/nginx-gateway-fabric/blob/e1d6ebb5065bab73af3a89faba4f49c7a5b971cd/internal/framework/controller/reconciler.go#L76 (edited)

Venryx commented 7 months ago

While it is a "workaround" rather than a "proper fix", I ended up changing the tiltfiles from using "helm_resource" to the older "helm_remote", and this seems to have resolved the issue (no recurrence in the last several days). Basically: Something was calling "uninstall" on the helm charts marked within the remote cluster. By using "helm_remote" instead, we just deploy the individual resources rather than under a "chart" entry, making this "unwanted top-down uninstall" unable to happen.

I'll close this issue for now, since "helm_remote" resolves the problem, and works fine. But of course, if the "root cause" of this unwanted install is ever discovered, it's preferable to resolve that rather than having to use this (semi) workaround.