Fix that the NGF pod/deployment can occasionally "disappear"!

Venryx commented 7 months ago

Summary

General:

So far, the problem has always seemed to happen shortly (<30mins) after a tilt-up was performed.
Assuming occurrence no. 3 (on 2024-03-19) was the same kind as the previous ones, then the problem (which we observed in Lens) seems to be that something tells Kubernetes to "Uninstall" the Helm releases within the remote k8s cluster.
An "easy" solution to the problem might be switching from "helm_resource" in the Tiltfile to "helm_remote" (since the mysterious behavior has only been observed for things using "helm_resource"). This presumably would work, though I do also have some interest in understanding the root of the problem.

Possibly related:

Occurrences

Discovered: 2024-03-17 11:59am (PT, by Venryx)

Pod lost at: TODO
Last tilt-up: ~15m earlier (precise value todo) [tilt was closed prior to discovery iirc]

Discovered: 2024-03-18 9:15pm (PT, by Jamie)

Pod lost at: 7:52pm (PT)
Last tilt-up: ~15m earlier (precise value todo) [tilt was closed prior to discovery]

Discovered: 2024-03-19 3:44am (PT, by Venryx)

Pod lost at: TODO
Last tilt-up: ~15m earlier (precise value todo) [tilt was closed prior to discovery]
Observed cluster at 3:44am, and saw both of the helm releases ("ngf" and "reflector") in the "Uninstalling" state; we eventually had to sleep, so we killed the secrets associated with the release, and tilt-redeployed.

Discovered: 2024-03-19 5:34am (PT, by Venryx)

Pod lost at: TODO
Last tilt-up: ~5m earlier (precise value todo) [tilt was left running till discovery, but no changes/retriggers occurred iirc?; I had been doing a couple additional [tilt-up -> wait a while -> ctrl+c -> tilt-up] cycles, to try to reproduce the problem]
Observed cluster at 5:34am, and saw the "reflector" release getting stuck in a "Pending upgrade" state. This state is different than the prior occurrence (at the time we observed that occurrence). The running theory is that an interrupted tilt-up sometimes causes the release to get stuck in a "Pending upgrade" state, which then eventually causes Kubernetes to "timeout" the attempted deployment of the release, causing an uninstall to initiate, resulting in everything in the release (eventually) getting torn down.

Discovered: 2024-03-19 8:29pm (PT, by Venryx)

Pod lost at: 8:27 [as per uptimerobot.com]
Last tilt-up: 16hr earlier (or 8hr if you count one I started by accident, then canceled early, presumably while tiltfile code was still being parsed/initial-executed-through -- note that even for this case, I confirmed the releases still appeared to be in good/deployed condition) [tilt was closed prior to discovery]
Observed cluster at 8:31pm (PT, by Venryx), and the Releases list was empty; the ngf deployment and pod were also missing; forgot to check the secrets, but they were missing at least at the point that next tilt-up had completed [of switching to helm_remote]; didn't want to spend more time investigating helm_resource, so just switched to helm_remote, tilt-up'ed, then called it a day. (to be seen if this avoids the issue; I'm assuming it will)

Venryx commented 7 months ago

The below is some copied information obtained while debugging the issue. (search Slack DMs ~2024-03-18 for more details)

Okay, I checked the logs of the internal "systemd-journal", and the logs for the "loki visible" pods in each namespace, and the very first log that I was able to find (for the sequence that ends in the ngf destruction), is this line from above:

{"level":"info","ts":"2024-03-19T02:28:26Z","msg":"Reconciling the resource","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"sh.helm.release.v1.ngf.v11","namespace":"default"},"namespace":"default","name":"sh.helm.release.v1.ngf.v11","reconcileID":"be7ade12-dced-4459-a203-ee0a8433426c"}

With these fields on the log entry:

app: nginx-gateway-fabric
container: nginx-gateway
filename: /var/log/pods/default_ngf-nginx-gateway-fabric-7948556d5f-wzhnx_5428d529-0c7a-4ad0-9d22-7a02f69ec33b/nginx-gateway/0.log
instance: ngf
job: default/nginx-gateway-fabric
namespace: default
node_name: pool-15gb-node-651007
pod: ngf-nginx-gateway-fabric-7948556d5f-wzhnx
stream: stderr

That is, the first log entry in that sequence was made by the "nginfx-gateway-fabric" pod.

So the NGF pod is apparently looking at the Secret and trying to reconcile the cluster state to "upsert to it", for some reason. Maybe it's a timer, maybe it's something else.

To try to figure that out, I tracked down the line that is causing that log line (I think this is it, anyway): https://github.com/nginxinc/nginx-gateway-fabric/blob/e1d6ebb5065bab73af3a89faba4f49c7a5b971cd/internal/framework/controller/reconciler.go#L76 (edited)

Venryx commented 7 months ago

While it is a "workaround" rather than a "proper fix", I ended up changing the tiltfiles from using "helm_resource" to the older "helm_remote", and this seems to have resolved the issue (no recurrence in the last several days). Basically: Something was calling "uninstall" on the helm charts marked within the remote cluster. By using "helm_remote" instead, we just deploy the individual resources rather than under a "chart" entry, making this "unwanted top-down uninstall" unable to happen.

I'll close this issue for now, since "helm_remote" resolves the problem, and works fine. But of course, if the "root cause" of this unwanted install is ever discovered, it's preferable to resolve that rather than having to use this (semi) workaround.

debate-map / app