emissary fails to recover after SIGTERM - possibly due to frequent reconfig

fs185143 commented 1 month ago

Describe the bug

emissary-ingress running on fairly demanding environment with many resources/mappings/services etc.,

Start seeing warnings such as:

"Memory Usage: throttling reconfig v2854 due to constrained memory with 112 stale reconfigs (60 max)"
...
"Memory Usage: throttling reconfig v3477 due to constrained memory with 123 stale reconfigs (120 max)"

Modest memory usage

"Memory Usage 0.32Gi (36%)\n    PID 1, 0.14Gi: busyambassador entrypoint \n    PID 22, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 23, 0.16Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 34, 0.09Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 600 -l error \n    PID 38405, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38423, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38441, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 38471, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate "

ambassador container receives SIGTERM

"time="2024-07-18 11:54:53.2387" level=error msg="goroutine \":signal_handler:0\" exited with error: received signal terminated (triggering graceful shutdown)" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=":signal_handler:0""
"shutting down (gracefully)...
...
"    :signal_handler:0        : exited with error"
"time="2024-07-18 11:54:55.1160" level=error msg="shut down with error error: received signal terminated (triggering graceful shutdown)" func=github.com/emissary-ingress/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=entrypoint PID=1"
"time="2024-07-18 22:08:49.9497" level=error msg="goroutine \":signal_handler:0\" exited with error: received signal terminated (triggering graceful shutdown)" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=":signal_handler:0""

A new emissary-ingress pod is created and jumps right into

time="2024-07-19 05:44:00.3312" level=info msg="Memory Usage 0.35Gi (39%)\n    PID 1, 0.16Gi: busyambassador entrypoint \n    PID 25, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 26, 0.16Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 37, 0.10Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 600 -l error \n    PID 31820, 0.09Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31850, 0.05Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31892, 0.04Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 31946, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate " func="github.com/emissary-ingress/emissary/v3/pkg/memory.(*MemoryUsage).Watch.func1" file="/go/pkg/memory/memory.go:39" CMD=entrypoint PID=1 THREAD=/memory
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: EnvoyConfig: Generating V3
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: V3Listener <V3Listener HTTP emissary-listener on 0.0.0.0:8888 [XFP]>: generated ===========================
2024-07-19 05:44:22 diagd 3.9.1 [P26TAEW] INFO: V3Ready: ==== listen on 127.0.0.1:8006
2024-07-19 05:44:23 diagd 3.9.1 [P26TThreadPoolExecutor-0_1] INFO: 41A43E42-219C-4FBE-9C8B-9128C3123F01: 127.0.0.1 "GET /metrics" 658ms 200 success
2024-07-19 05:44:24 diagd 3.9.1 [P26TAEW] INFO: configuration updated (incremental) from snapshot snapshot (S33 L1 G311 C311)
time="2024-07-19 05:44:24.4078" level=info msg="Loaded file /ambassador/envoy/envoy.json" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.Decode file="/go/pkg/ambex/main.go:281" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2024-07-19 05:44:24.5576" level=info msg="Saved snapshot v5318" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.csDump file="/go/pkg/ambex/main.go:351" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2024-07-19 05:44:24.5660" level=info msg="Pushing snapshot v5318" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:159" CMD=entrypoint PID=1 THREAD=/ambex/updater
...

No longer works, e.g., gives errors such as these where previously they would be 30x or 200.


ACCESS [2024-07-19T05:44:22.929Z] "GET /grafana/metrics HTTP/1.1" 503 UH 0 4172 16 - "<ip>,<ip>,<ip>" "Prometheus/2.35.0-gmp.4" "<uuid>" "<uuid>.<domain>" "-"

ACCESS [2024-07-19T09:59:29.493Z] "GET HTTP/1.1" 403 UAEX 0 3984 0 - ", ,, , ," "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0" " "." "-"


**To Reproduce**
Can work on providing a more concrete example if needed, but we are observing this after around 12-24h on a cluster as described.

**Expected behaviour**
Should be able to recover after being replaced. We have observed the SIGTERM due to a promotion which is expected, but also SIGKILL due to memory usage - the latter seems to have been fixed after increasing our resource limits.

**Versions (please complete the following information):**
 - Ambassador: 3.9.1
 - Kubernetes environment: GKE
 - Version: v1.27.14-gke.1042001

**Additional context**
Possibly related:
- https://github.com/emissary-ingress/emissary/issues/3684
- https://github.com/emissary-ingress/emissary/issues/3264

Some relevant configuration options:
```yaml
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
      - name: ambassador
        lifecycle:
          preStop:
            exec:
              command:
              - sleep 60
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 900Mi
          requests:
            cpu: 200m
            memory: 600Mi

kflynn commented 1 month ago

@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂)

cindymullins-dw commented 1 month ago

Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT.

_A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOTCOUNT is met.

fs185143 commented 1 month ago

Failing to recover after the SIGTERM sounds like a bug. But you can take some measures to address large reconfiguration or High CPU/memory usage with some of these specs. You can also try reducing AMBASSADOR_AMBEX_SNAPSHOT_COUNT.

_A common reason for high CPU and memory is a high number of automated snapshots of your configuration. Snapshots are only stored until the limit of AMBASSADOR_AMBEX_SNAPSHOTCOUNT is met.

thanks - we will try this on our emissary/emissary-ingress deployment

spec:
  template:
    spec:
      containers:
      - name: ambassador
        env:
        - name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
          value: "5" # default 30
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "false"
        - name: AMBASSADOR_DRAIN_TIME
          value: "300" # default 600

fs185143 commented 1 month ago

@fs185143, just to confirm, is this with Emissary or Ambassador Edge Stack? (It's a bug either way, just want to make sure of what I'm looking at. 🙂)

I believe it comes under emissary as we are not using the enterprise application

fs185143 commented 1 month ago

We are now seeing these ambassador logs

Memory Usage 0.85Gi (96%)\n    PID 1, 0.14Gi: busyambassador entrypoint \n    PID 16, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 18, 0.34Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 29, 0.11Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 300 -l error \n    PID 246459, 0.10Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 246489, 0.08Gi (exited): envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate \n    PID 246525, 0.09Gi: envoy --service-node test-id --service-cluster ambassador-emissary --config-path /ambassador/snapshots/econf-tmp.json --mode validate
...
time="2024-08-04 13:55:43.4550" level=warning msg="Memory Usage: throttling reconfig v41082 due to constrained memory with 15 stale reconfigs (1 max)" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:140" CMD=entrypoint PID=1 THREAD=/ambex/updater

which caused the emissary-ingress pod to restart after running for 3 days, and post-restart it only gives 403 errors.

This is with the following config

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
      - name: ambassador
        env:
        - name: AMBASSADOR_AMBEX_SNAPSHOT_COUNT
          value: "5"
        - name: AMBASSADOR_FAST_RECONFIGURE
          value: "false"
        - name: AMBASSADOR_DRAIN_TIME
          value: "300"
        - name: SCOUT_DISABLE
          value: "1"
        lifecycle:
          preStop:
            exec:
              command:
              - sleep 60
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_alive
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ambassador/v0/check_ready
            port: admin
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 6
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 900Mi
          requests:
            cpu: 200m
            memory: 600Mi

Would appreciate any advice/support - thanks.

I am wondering if we might just need to increase the memory limit, but also not sure why it gives the ACCESS 403 errors after it restarts?

fs185143 commented 1 month ago

Going to try:

Lower AMBASSADOR_AMBEX_SNAPSHOT_COUNT from 5 to 0
Re-enable AMBASSADOR_FAST_RECONFIGURE
Increase AMBASSADOR_RECONFIG_MAX_DELAY from 1 to 3
Increase memory limit from 900Mi to 2000Mi
Increase memory request from 600Mi to 1000Mi

Not sure if it's worth changing AMBASSADOR_SNAPSHOT_COUNT?

fs185143 commented 1 month ago

Could the 403s be due to some cert/auth issue that occurs after restart perhaps?

Maybe related to:

fs185143 commented 1 month ago

OK so I think it's actually more related to https://github.com/emissary-ingress/emissary/issues/5564

We are seeing these logs after enabling debug

[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
[2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms

fs185143 commented 4 weeks ago

OK so I think it's actually more related to #5564

We are seeing these logs after enabling debug

[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for authserver.authserver failed with c-ares status 12
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
[2024-08-07 11:02:15.136][31][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for authserver.authserver completed with status 1
[2024-08-07 11:02:15.136][31][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for authserver.authserver, (failure) refresh rate 5000 ms

Do we think this may have been fixed by a version of envoy > 1.27.2? e.g.,

https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.28/v1.28.0#bug-fixes
- dns: Fixed a bug where when respect_dns_ttl was set to true, c-ares dns resolver only considered address record for TTL calculation while ignoring CNAME records TTL. Now when respect_dns_ttl is set to true minimum of all TTL records is considered.
https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.29/v1.29.0#bug-fixes
- DNS: Fixed a race condition that when multiple requests with the same authority header are sent to Envoy, sometimes some requests may receive 503 response with no_healthy_upstream from Envoy. The fix is guarded by runtime guard envoy.reloadable_features.dns_cache_set_first_resolve_complete, which defaults to true.
https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.30/v1.30.0#minor-behavior-changes
- dns: Fixing a bug in the getaddrinfo resolver where it did not reresolve on EAI_AGAIN. This behavioral change can be temporarily reverted by setting envoy.reloadable_features.dns_reresolve_on_eai_again to false.
https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.31/v1.31.0#minor-behavior-changes
- dns: Changes the behavior of the getaddrinfo DNS resolver so that it treats EAI_NODATA and EAI_NONAME as successful queries with empty results, instead of as DNS failures. This change brings the getaddrinfo behavior in-line with the c-ares resolver behavior. This behavior can be reverted by setting the runtime guard envoy.reloadable_features.dns_nodata_noname_is_success to false.
https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.31/v1.31.0#bug-fixes
- cares: Upgraded c-ares library to 1.20.1 and added fix to c-ares DNS implementation to additionally check for ARES_EREFUSED, ARES_ESERVFAIL and ARES_ENOTIMP status. Without this fix, DestroyChannelOnRefused and CustomResolverValidAfterChannelDestruction unit test will break.

kflynn commented 4 weeks ago

@fs185143 That's definitely the first thing to try, which is part of why I'm trying to do a new 3.10 release. 🙂

Note that there's a development branch of Emissary 4 that's already running Envoy 1.30: if you're interesting in testing that to see how it behaves here, let me know.

emissary-ingress / emissary

emissary fails to recover after SIGTERM - possibly due to frequent reconfig #5725