Open isala404 opened 1 month ago
I ran some additional tests to fully understand the behavior we're observing. One of these tests involved chaining two Envoy instances to see how they interact. From this test, I discovered that Envoy is not designed to send a FIN packet to the downstream service. Instead, when a new request arrives over an already established connection from downstream, but the connection to the upstream service has been terminated, Envoy should initiate a new request to the upstream backend.
Please review the following Hubble log from the two chained Envoy servers:
As illustrated in the logs, Envoy-1 establishes a connection with Envoy-2 using port 52522. When the pod is deleted, the same port gets reused.
Is there an existing issue for this?
Version
higher than v1.14.13 and lower than v1.15.0
What happened?
We've encountered an issue with Cilium Envoy in our Kubernetes environment. Our setup consists of a request chain involving three services:
We're using Cilium proxy annotations for the workload container port, configured as ingress HTTP, primarily for observability purposes.
The problem occurs when the workload pod restarts. Immediately after the restart, the next request receives an error:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
(HTTP 503).Upon investigation, we discovered the following:
Cilium Envoy acts as an intermediary, creating two connections: one to the downstream API management Envoy and another to the upstream workload pod.
When the workload pod terminates, it emits a FIN packet. Cilium Envoy consumes this packet but doesn't terminate the downstream connection it created in relation to the upstream pod. This connection remains open until it hits the idle timeout.
If a new request is sent immediately after the workload pod restart, our Envoy attempts to reuse the existing connection, resulting in a 503 error.
The initial failed request causes the connection to terminate, allowing subsequent requests to succeed.
However, we noticed that with kubeproxy replacement enabled, even the subsequent requests continue to fail until we restart our Envoy to break the connection.
Hubble outputs
With kube proxy
Without kube proxy
How can we reproduce the issue?
cilium install --version 1.14.13
on AKS cluster with BYOCNI.envoy.yaml
```yaml apiVersion: v1 kind: Namespace metadata: name: apim --- apiVersion: v1 kind: ConfigMap metadata: name: envoy-config namespace: apim data: envoy.yaml: | static_resources: listeners: - address: socket_address: address: 0.0.0.0 port_value: 8080 filter_chains: - filters: - name: envoy.filters.network.http_connection_manager typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http route_config: name: local_route virtual_hosts: - name: backend domains: - "*" routes: - match: prefix: "/" route: cluster: foo_service http_filters: - name: envoy.filters.http.router typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router clusters: - name: foo_service connect_timeout: 0.25s type: STRICT_DNS lb_policy: ROUND_ROBIN load_assignment: cluster_name: foo_service endpoints: - lb_endpoints: - endpoint: address: socket_address: address: foo.bar.svc.cluster.local port_value: 8080 --- apiVersion: apps/v1 kind: Deployment metadata: name: envoy-proxy namespace: apim spec: replicas: 1 selector: matchLabels: app: envoy-proxy template: metadata: labels: app: envoy-proxy spec: containers: - name: envoy image: envoyproxy/envoy:v1.24.7 ports: - containerPort: 8080 volumeMounts: - name: envoy-config mountPath: /etc/envoy volumes: - name: envoy-config configMap: name: envoy-config --- apiVersion: v1 kind: Service metadata: name: envoy-proxy namespace: apim spec: selector: app: envoy-proxy ports: - protocol: TCP port: 8080 targetPort: 8080 --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: envoy-ingress namespace: apim spec: ingressClassName: nginx rules: - host: envoy.cilium.internal http: paths: - path: / pathType: Prefix backend: service: name: envoy-proxy port: number: 8080 ```workload.yaml
```yaml apiVersion: v1 kind: Namespace metadata: name: bar --- apiVersion: apps/v1 kind: Deployment metadata: name: foo namespace: bar spec: replicas: 1 selector: matchLabels: app: foo template: metadata: labels: app: foo annotations: policy.cilium.io/proxy-visibility: "curl LB-IP/healthz -H "Host: envoy.cilium.internal"
.kubectl delete pods --all -n bar
.Cilium Version
1.14.13
Kernel Version
5.15.0-1067-azure
Kubernetes Version
v1.28.10
Regression
Sysdump
With Kube Proxy
cilium-sysdump-20240717-182934.zip
Without Kube Proxy
cilium-sysdump-20240717-185333.zip
Relevant log output
No response
Anything else?
Issue was happening regardless of wireguard encryption being enabled or not.
Cilium Users Document
Code of Conduct