emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.34k stars 681 forks source link

emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725

Open fs185143 opened 1 month ago

fs185143 commented 1 month ago

Describe the bug

ACCESS [2024-07-19T09:59:29.493Z] "GET HTTP/1.1" 403 UAEX 0 3984 0 - ", ,, , ," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0" " "." "-"


**To Reproduce**
Can work on providing a more concrete example if needed, but we are observing this after around 12-24h on a cluster as described.

**Expected behaviour**
Should be able to recover after being replaced. We have observed the SIGTERM due to a promotion which is expected, but also SIGKILL due to memory usage - the latter seems to have been fixed after increasing our resource limits.

**Versions (please complete the following information):**
 - Ambassador: 3.9.1
 - Kubernetes environment: GKE
 - Version: v1.27.14-gke.1042001

**Additional context**
Possibly related:
- https://github.com/emissary-ingress/emissary/issues/3684
- https://github.com/emissary-ingress/emissary/issues/3264

Some relevant configuration options:
```yaml
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
      - name: ambassador
        lifecycle:
          preStop:
            exec:
              command:
              - sleep 60
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 900Mi
          requests:
            cpu: 200m
            memory: 600Mi
kflynn commented 1 month ago

@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂)

cindymullins-dw commented 1 month ago

Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT.

_A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOTCOUNT is met.

fs185143 commented 1 month ago

Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT.

_A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOTCOUNT is met.

thanks - we will try this on our emissary/emissary-ingress deployment

spec:
  template:
    spec:
      containers:
      - name: ambassador
        env:
        - name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
          value: "5" # default 30
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "false"
        - name: AMBASSADOR_DRAIN_TIME
          value: "300" # default 600
fs185143 commented 1 month ago

@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂)

I believe it comes under emissary as we are not using the enterprise application

fs185143 commented 1 month ago

We are now seeing these ambassador logs

Memory Usage 0.85Gi (96%)\n    PID 1, 0.14Gi: busyambassador entrypoint \n    PID 16, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 18, 0.34Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 29, 0.11Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 300 -l error \n    PID 246459, 0.10Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 246489, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 246525, 0.09Gi: envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate
...
time="2024-08-04 13:55:43.4550" level=warning msg="Memory Usage: throttling reconfig v41082 due to constrained memory with 15 stale reconfigs (1 max)" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:140" CMD=entrypoint PID=1 THREAD=/ambex/updater

which caused the emissary-ingress pod to restart after running for 3 days, and post-restart it only gives 403 errors.

This is with the following config

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
      - name: ambassador
        env:
        - name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
          value: "5"
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "false"
        - name: AMBASSADOR_DRAIN_TIME
          value: "300"
        - name: SCOUT_DISABLE
          value: "1"
        lifecycle:
          preStop:
            exec:
              command:
              - sleep 60
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 900Mi
          requests:
            cpu: 200m
            memory: 600Mi

Would appreciate any advice/support - thanks.

I am wondering if we might just need to increase the memory limit, but also not sure why it gives the ACCESS 403 errors after it restarts?

fs185143 commented 1 month ago

Going to try:

Not sure if it's worth changing AMBASSADOR_SNAPSHOT_COUNT?

fs185143 commented 1 month ago

Could the 403s be due to some cert/auth issue that occurs after restart perhaps?

Maybe related to:

fs185143 commented 1 month ago

OK so I think it's actually more related to https://github.com/emissary-ingress/emissary/issues/5564

We are seeing these logs after enabling debug

[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
[2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms
fs185143 commented 4 weeks ago

OK so I think it's actually more related to #5564

We are seeing these logs after enabling debug

[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
[2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms

Do we think this may have been fixed by a version of envoy > 1.27.2? e.g.,

kflynn commented 4 weeks ago

@fs185143 That's definitely the first thing to try, which is part of why I'm trying to do a new 3.10 release. 🙂

Note that there's a development branch of Emissary 4 that's already running Envoy 1.30: if you're interesting in testing that to see how it behaves here, let me know.