Closed Venryx closed 7 months ago
The below is some copied information obtained while debugging the issue. (search Slack DMs ~2024-03-18 for more details)
Okay, I checked the logs of the internal "systemd-journal", and the logs for the "loki visible" pods in each namespace, and the very first log that I was able to find (for the sequence that ends in the ngf destruction), is this line from above:
{"level":"info","ts":"2024-03-19T02:28:26Z","msg":"Reconciling the resource","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"sh.helm.release.v1.ngf.v11","namespace":"default"},"namespace":"default","name":"sh.helm.release.v1.ngf.v11","reconcileID":"be7ade12-dced-4459-a203-ee0a8433426c"}
With these fields on the log entry:
app: nginx-gateway-fabric
container: nginx-gateway
filename: /var/log/pods/default_ngf-nginx-gateway-fabric-7948556d5f-wzhnx_5428d529-0c7a-4ad0-9d22-7a02f69ec33b/nginx-gateway/0.log
instance: ngf
job: default/nginx-gateway-fabric
namespace: default
node_name: pool-15gb-node-651007
pod: ngf-nginx-gateway-fabric-7948556d5f-wzhnx
stream: stderr
That is, the first log entry in that sequence was made by the "nginfx-gateway-fabric" pod.
So the NGF pod is apparently looking at the Secret and trying to reconcile the cluster state to "upsert to it", for some reason. Maybe it's a timer, maybe it's something else.
To try to figure that out, I tracked down the line that is causing that log line (I think this is it, anyway): https://github.com/nginxinc/nginx-gateway-fabric/blob/e1d6ebb5065bab73af3a89faba4f49c7a5b971cd/internal/framework/controller/reconciler.go#L76 (edited)
While it is a "workaround" rather than a "proper fix", I ended up changing the tiltfiles from using "helm_resource" to the older "helm_remote", and this seems to have resolved the issue (no recurrence in the last several days). Basically: Something was calling "uninstall" on the helm charts marked within the remote cluster. By using "helm_remote" instead, we just deploy the individual resources rather than under a "chart" entry, making this "unwanted top-down uninstall" unable to happen.
I'll close this issue for now, since "helm_remote" resolves the problem, and works fine. But of course, if the "root cause" of this unwanted install is ever discovered, it's preferable to resolve that rather than having to use this (semi) workaround.
Summary
General:
Possibly related:
Occurrences
Discovered: 2024-03-17 11:59am (PT, by Venryx)
Discovered: 2024-03-18 9:15pm (PT, by Jamie)
Discovered: 2024-03-19 3:44am (PT, by Venryx)
Discovered: 2024-03-19 5:34am (PT, by Venryx)
Discovered: 2024-03-19 8:29pm (PT, by Venryx)