emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.37k stars 687 forks source link

Getting error `Chain context mismatch` which results in high memory usage and crashes #4916

Closed robcxyz closed 1 year ago

robcxyz commented 1 year ago

Describe the bug

I am getting the following error messages from the admin console and emissary pod logs:

Chain context mismatch: Host tracker-v2-governance-apex-host cannot combine with foo.bar.baz.com
Chain context mismatch: Host tracker-v2-governance-apex-host cannot combine with foo.bar.baz.com
Chain context mismatch: Host tracker-v2-governance-host cannot combine with region.v1.foo.bar.baz.com
Chain context mismatch: Host tracker-v2-governance-host cannot combine with region.v1.foo.bar.baz.com

The emissary pods are also running near their memory limit (%70) and crashing frequently especially when any configuration change is applied to them. Talked with @cindymullins-dw on slack and she mentioned to post the issue here.

For my configuration I am trying to serve the same service on two endpoints. Here is a simplified version of my config:

---
# Apex
apiVersion: getambassador.io/v2
kind: Host
metadata:
  name: "tracker-v2-governance-apex-host"
spec:
  hostname: "foo.bar.baz.com"
  tlsSecret:
    name: "community-origin-tls"
  requestPolicy:
    insecure:
      action: Redirect
      additionalPort: 8080

---
# Regional
apiVersion: getambassador.io/v2
kind: Host
metadata:
  name: "tracker-v2-governance-host"
spec:
  hostname: "region.v1.foo.bar.baz.com"
  tlsSecret:
    name: "community-regional-origin-tls"
  requestPolicy:
    insecure:
      action: Redirect
      additionalPort: 8080

---
apiVersion: getambassador.io/v2
kind:  Mapping
metadata:
  name: tracker-v2-governance-rest
spec:
  hostname: foo.bar.baz.com
  prefix: /api/v1/governance
  rewrite: /api/v1/governance
  service: tracker-v2-governance.tracker-v2.svc

  timeout_ms: 10000
  connect_timeout_ms: 10000
  retry_policy:
    retry_on: "5xx"
    num_retries: 10
  circuit_breakers:
    - max_connections: 4096
      max_pending_requests: 4096
  add_response_headers:
    x-sb-region: "sng"

---

apiVersion: getambassador.io/v2
kind:  Mapping
metadata:
  name: tracker-regional-v2-governance-rest
  labels:
    service_name: tracker-regional-v2-governance
spec:
  hostname: region.v1.foo.bar.baz.com
  prefix: /api/v1/governance
  rewrite: /api/v1/governance
  service: tracker-v2-governance.tracker-v2.svc

  timeout_ms: 10000
  connect_timeout_ms: 10000
  retry_policy:
    retry_on: "5xx"
    num_retries: 10
  circuit_breakers:
    - max_connections: 4096
      max_pending_requests: 4096
  add_response_headers:
    x-sb-region: "sng"

And I get this log on a crash, sorry if this is excessive but don't know if this will help.

time="2023-03-11 04:50:58.8891" level=info msg="started command [\"envoy\" \"-c\" \"/ambassador/bootstrap-ads.json\" \"--base-id\" \"0\" \"--drain-time-s\" \"600\" \"-l\" \"error\" \"--bootstrap-version\" \"3\"]" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="/go/vend
time="2023-03-11 04:50:58.8893" level=info msg="not logging input read from file \"/dev/stdin\"" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:185" CMD=entrypoint PID=1 THREAD=/envoy dexec.pid=31 dexec.stream=stdin
time="2023-03-11 04:50:58.8893" level=info msg="not logging output written to file \"/dev/stdout\"" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:188" CMD=entrypoint PID=1 THREAD=/envoy dexec.pid=31 dexec.stream=stdout
time="2023-03-11 04:50:58.8894" level=info msg="not logging output written to file \"/dev/stderr\"" func="github.com/datawire/dlib/dexec.(*Cmd).Start" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:191" CMD=entrypoint PID=1 THREAD=/envoy dexec.pid=31 dexec.stream=stderr
time="2023-03-11 04:50:58.9755" level=info msg="Memory Usage 0.22Gi (37%)\n    PID 1, 0.17Gi: busyambassador entrypoint \n    PID 14, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/n
2023-03-11 04:50:59 diagd 2.2.2 [P18TThreadPoolExecutor-0_1] INFO: F1FA12FC-2AA5-4E84-9B29-05ECE084398E: 127.0.0.1 "GET /metrics" 12ms 200 success
time="2023-03-11 04:50:59.9536" level=info msg="Loaded file /ambassador/envoy/envoy.json" func=github.com/datawire/ambassador/v2/cmd/ambex.Decode file="/go/cmd/ambex/main.go:279" CMD=entrypoint PID=1 THREAD=/ambex
time="2023-03-11 04:51:00.6680" level=info msg="Saved snapshot v1" func=github.com/datawire/ambassador/v2/cmd/ambex.csDump file="/go/cmd/ambex/main.go:369" CMD=entrypoint PID=1 THREAD=/ambex
time="2023-03-11 04:51:00.6685" level=info msg="Pushing snapshot v1" func=github.com/datawire/ambassador/v2/cmd/ambex.updaterWithTicker file="/go/cmd/ambex/ratelimit.go:159" CMD=entrypoint PID=1 THREAD=/ambex
2023-03-11 04:51:12 diagd 2.2.2 [P18TThread-3] WARNING: EnvoyStats.update_log_levels failed: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
2023-03-11 04:51:12 diagd 2.2.2 [P18TThread-3] WARNING: EnvoyStats.update failed: HTTPConnectionPool(host='127.0.0.1', port=8001): Max retries exceeded with url: /stats (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f427daa64c0>: Failed to establ
2023/03/11 04:51:12 http: proxy error: context canceled
time="2023-03-11 04:51:12.4550" level=info msg="finished with error: signal: killed" func="github.com/datawire/dlib/dexec.(*Cmd).Wait" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:257" CMD=entrypoint PID=1 THREAD=/envoy dexec.pid=31
time="2023-03-11 04:51:12.4561" level=info msg="Memory Usage 0.35Gi (58%)\n    PID 1, 0.28Gi: busyambassador entrypoint \n    PID 14, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/n
time="2023-03-11 04:51:12.4641" level=error msg="goroutine \"/envoy\" exited with error: signal: killed" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=/envoy
time="2023-03-11 04:51:12.4793" level=info msg="Memory Usage 0.35Gi (58%)\n    PID 1, 0.28Gi: busyambassador entrypoint \n    PID 14, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/n
time="2023-03-11 04:51:12.4805" level=info msg="shutting down (gracefully)..." func="github.com/datawire/dlib/dgroup.(*Group).launchSupervisors.func1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:238" CMD=entrypoint PID=1 THREAD=":shutdown_logger"
time="2023-03-11 04:51:12.4806" level=info msg="sending SIGINT" func="github.com/datawire/dlib/dexec.(*Cmd).Start.func1" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:211" CMD=entrypoint PID=1 THREAD=/diagd
time="2023-03-11 04:51:12.4807" level=info msg="shutting down (gracefully)..." func="github.com/datawire/dlib/dgroup.(*Group).launchSupervisors.func1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:238" CMD=entrypoint PID=1 THREAD="/watcher:shutdown_logger"
time="2023-03-11 04:51:12.4807" level=info msg=Done func=github.com/datawire/ambassador/v2/cmd/ambex.Main2 file="/go/cmd/ambex/main.go:948" CMD=entrypoint PID=1 THREAD=/ambex
[2023-03-11 04:51:12 +0000] [14] [INFO] Handling signal: int
2023-03-11 04:51:12 diagd 2.2.2 [P18TThreadPoolExecutor-0_0] WARNING: EnvoyStats.get_prometheus_state failed: HTTPConnectionPool(host='127.0.0.1', port=8001): Max retries exceeded with url: /stats/prometheus (Caused by NewConnectionError('<urllib3.connection.HTTPConnection obje
2023-03-11 04:51:12 diagd 2.2.2 [P18TThreadPoolExecutor-0_0] INFO: 9ADC6366-C5AA-489B-BFB0-61F6CF7FDED4: 127.0.0.1 "GET /metrics" 5ms 200 success
[2023-03-11 04:51:12 +0000] [18] [INFO] Worker exiting (pid: 18)
[2023-03-11 04:51:13 +0000] [14] [INFO] Shutting down: Master
time="2023-03-11 04:51:13.3869" level=info msg="finished successfully: exit status 0" func="github.com/datawire/dlib/dexec.(*Cmd).Wait" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:255" CMD=entrypoint PID=1 THREAD=/diagd dexec.pid=14
time="2023-03-11 04:51:13.3890" level=info msg="  final goroutine statuses:" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:84" CMD=entrypoint PID=1 THREAD=":shutdown_status"

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Sorry, but this is very specific to my environment. Let me know if I should work more on reproducing this on minikube.

Expected behavior

No crashes.

Versions (please complete the following information):

Additional context

Cindy asked me to verify that the TLS secrets are in the same namespace -> they are. She also linked me to this SO question but not sure what to do to fix it.

More than willing to upgrade to the newer emissary as the LOC that is causing the issue doesn't seem to exist in newer versions -> https://github.com/emissary-ingress/emissary/blob/dddd983be480b02465da4f396a29ec9ce75557fd/python/ambassador/envoy/v2/v2listener.py#L89

Thank you very much for your help.

cindymullins-dw commented 1 year ago

Is it possible that this is a certificate mismatch, as in this question?