envoyproxy / gateway

Manages Envoy Proxy as a Standalone or Kubernetes-based Application Gateway
https://gateway.envoyproxy.io
Apache License 2.0
1.54k stars 333 forks source link

504 - upstream connect error or disconnect/reset before headers. reset reason: connection timeout #4114

Closed amalic closed 4 days ago

amalic commented 4 weeks ago

Using Envoy Gateway 1.0.1. on our development cluster. All HttpRoutes work, except one route for a React Frontend. We are getting a 504 error on the client and a text saying upstream connect error or disconnect/reset before headers. reset reason: connection timeout which does not look like an error our app would output.

What we have already tested:

Here's our setup

Please note I have replaced IP's, domains, and namespaces with something I can share publicly. In case you find some naming error it might have been a typo while replacing our internal names.

curl

# curl -v webapp.mydomain.mytld

* Host webapp.mydomain.mytld:80 was resolved.
* IPv6: (none)
* IPv4: xxx.xxx.xxx.xxx, yyy.yyy.yyy.yyy
*   Trying xxx.xxx.xxx.xxx:80...
* Connected to webapp.mydomain.mytld (xxx.xxx.xxx.xxx) port 80
> GET / HTTP/1.1
> Host: webapp.mydomain.mytld
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 504 Gateway Timeout
< content-length: 24
< content-type: text/plain
< date: Mon, 26 Aug 2024 11:09:27 GMT
<
* Connection #0 to host webapp.mydomain.mytld left intact
upstream request timeout%

Error Message from envoy gateway logs

{
    "start_time": "2024-08-23T09:27:27.939Z",
    "method": "GET",
    "x-envoy-origin-path": "/",
    "protocol": "HTTP/2",
    "response_code": "503",
    "response_flags": "UF",
    "response_code_details": "upstream_reset_before_response_started{connection_timeout}",
    "connection_termination_details": "-",
    "upstream_transport_failure_reason": "-",
    "bytes_received": "0",
    "bytes_sent": "91",
    "duration": "9998",
    "x-envoy-upstream-service-time": "-",
    "x-forwarded-for": "111.222.333.444",
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "x-request-id": "ed5937ea-6a79-4639-98c7-a84f4471b94c",
    ":authority": "webapp.mydomain.mytld",
    "upstream_host": "1.2.3.4:80",
    "upstream_cluster": "httproute/dev-stage/webapp/rule/0",
    "upstream_local_address": "-",
    "downstream_local_address": "11.22.33.44:10443",
    "downstream_remote_address": "111.222.333.444:44107",
    "requested_server_name": "webapp.mydomain.mytld",
    "route_name": "httproute/dev-stage/webapp/rule/0/match/0/webapp_mydomain_mytld"
}

Webapp Manifests

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  namespace: dev-stage
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 2 # the old replica must be kept running until new replica is fully operational
      maxSurge: 1 # 1 old and 1 new replica can be active at the same time during deployments
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:

      affinity:
        podAffinity:
          # prefer to schedule related pods on same host
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["api", "webapp", "swagger-ui"]
              topologyKey: kubernetes.io/hostname
        podAntiAffinity:
          # require to *not* schedule pods on the same *host* where we are already running again
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values: ["webapp"]
            topologyKey: kubernetes.io/hostname

      terminationGracePeriodSeconds: 10
      containers:
      - image: <myregistry>/<myimage>:<tag>
        name: webapp
        ports:
        - containerPort: 80
          name: webapp
---
kind: Service
apiVersion: v1
metadata:
  name: webapp
  namespace: dev-stage
spec:
  selector:
    app: webapp
  ports:
  # public
  - name: webapp
    protocol: TCP
    port: 80
    targetPort: 80
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: webapp
  namespace: dev-stage
spec:
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: envoy-gw
      namespace: gwapi-system
  hostnames:
   - "webapp.mydomain.mytld"
  rules:
    - backendRefs:
        - name: webapp
          kind: Service
          namespace: dev-stage
          port: 80
          weight: 1
      matches:
        - path:
            type: PathPrefix
            value: /

Gateway Manifests

---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: GatewayClass
metadata:
  name: envoy-gc
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
  parametersRef:
    group: gateway.envoyproxy.io
    kind: EnvoyProxy
    name: custom-proxy-config
    namespace: gwapi-system
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: gwapi-system
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyDeployment:
        replicas: 3
      envoyService:
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-type: external
          service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
          service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: envoy-gw
  namespace: gwapi-system
spec:
  gatewayClassName: envoy-gc
  listeners:
  - allowedRoutes:
      namespaces:
        from: All
    hostname: '*.mydomain.mytld'
    name: http
    port: 80
    protocol: HTTP
  - allowedRoutes:
      namespaces:
        from: All
    hostname: '*.mydomain.mytld'
    name: https
    port: 443
    protocol: HTTPS
    tls:
      certificateRefs:
      - group: ""
        kind: Secret
        name: envoy-gw-tls-cert
      mode: Terminate
arkodg commented 4 weeks ago

@amalic

amalic commented 4 weeks ago

is the error consistently seen or only sometimes after a duration ?

yes

can you try v1.1.0 instead, does anything change with that helm chart ?

not at the moment

the access logs shows "protocol": "HTTP/2" , but you are not using GRPCRoute nor are you setting any applicationProtocol field on the Service, so its weird why envoy is trying to connect to the upstream over http2

It's very strange.

The docker File is based on a nginx:alpine image. I even tried to increase timeouts and setting http1 protocol through a ClientTrafficPolicy and 5 retries on any 5xx error through a BackendTrafficPolicy. Still the same result. And like I already said, when I port-forward the service or pod I get the expected response.

nginx default.config

server {
    listen 80;
    server_name _;
   #...
}
arkodg commented 4 weeks ago

@amalic the issue is that

kind: HTTPRoute
metadata:
  name: webapp

is in the default ns and your backend is in dev-stage and there isnt any ReferenceGrant to allow linking route and backend, can you either add a ReferenceGrant or move the route into the backend ns ? the status field on the resource should be surfacing this

amalic commented 3 weeks ago

@arkodg Thanks for pointing that out. I actually copied the manifest from the yaml file which is applied using the specific namespace with kubectl. I double checked if it's in the correct namespace on the cluster, and fixed it in the samples I provided.

amalic commented 3 weeks ago

@arkodg Thanks to your HTTP2 comment I expanded my research, I came across this on the ISTIO Traffic Management Problems site

Envoy requires HTTP/1.1 or HTTP/2 traffic for upstream services. For example, when using NGINX for serving traffic behind Envoy, you will need to set the proxy_http_version directive in your NGINX configuration to be “1.1”, since the NGINX default is 1.0.

https://istio.io/latest/docs/ops/common-problems/network-issues/#envoy-wont-connect-to-my-http10-service

What do you think?

amalic commented 2 weeks ago

@arkodg When I run nginx -T in a shell within the container I get followint output. Means the server is responding via HTTP 1.1. I can confirm this when doing a curl against the port-forwarded service and pods. I wll try to update to latest Envoy version to see if it will fix the error.

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
# configuration file /etc/nginx/nginx.conf:

user  nginx;
worker_processes  auto;

error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    keepalive_timeout  65;

    #gzip  on;

    include /etc/nginx/conf.d/*.conf;
}

# configuration file /etc/nginx/mime.types:

types {
    text/html                                        html htm shtml;
    text/css                                         css;
    text/xml                                         xml;
    image/gif                                        gif;
    image/jpeg                                       jpeg jpg;
    application/javascript                           js;
    application/atom+xml                             atom;
    application/rss+xml                              rss;

    text/mathml                                      mml;
    text/plain                                       txt;
    text/vnd.sun.j2me.app-descriptor                 jad;
    text/vnd.wap.wml                                 wml;
    text/x-component                                 htc;

    image/avif                                       avif;
    image/png                                        png;
    image/svg+xml                                    svg svgz;
    image/tiff                                       tif tiff;
    image/vnd.wap.wbmp                               wbmp;
    image/webp                                       webp;
    image/x-icon                                     ico;
    image/x-jng                                      jng;
    image/x-ms-bmp                                   bmp;

    font/woff                                        woff;
    font/woff2                                       woff2;

    application/java-archive                         jar war ear;
    application/json                                 json;
    application/mac-binhex40                         hqx;
    application/msword                               doc;
    application/pdf                                  pdf;
    application/postscript                           ps eps ai;
    application/rtf                                  rtf;
    application/vnd.apple.mpegurl                    m3u8;
    application/vnd.google-earth.kml+xml             kml;
    application/vnd.google-earth.kmz                 kmz;
    application/vnd.ms-excel                         xls;
    application/vnd.ms-fontobject                    eot;
    application/vnd.ms-powerpoint                    ppt;
    application/vnd.oasis.opendocument.graphics      odg;
    application/vnd.oasis.opendocument.presentation  odp;
    application/vnd.oasis.opendocument.spreadsheet   ods;
    application/vnd.oasis.opendocument.text          odt;
    application/vnd.openxmlformats-officedocument.presentationml.presentation
                                                     pptx;
    application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
                                                     xlsx;
    application/vnd.openxmlformats-officedocument.wordprocessingml.document
                                                     docx;
    application/vnd.wap.wmlc                         wmlc;
    application/wasm                                 wasm;
    application/x-7z-compressed                      7z;
    application/x-cocoa                              cco;
    application/x-java-archive-diff                  jardiff;
    application/x-java-jnlp-file                     jnlp;
    application/x-makeself                           run;
    application/x-perl                               pl pm;
    application/x-pilot                              prc pdb;
    application/x-rar-compressed                     rar;
    application/x-redhat-package-manager             rpm;
    application/x-sea                                sea;
    application/x-shockwave-flash                    swf;
    application/x-stuffit                            sit;
    application/x-tcl                                tcl tk;
    application/x-x509-ca-cert                       der pem crt;
    application/x-xpinstall                          xpi;
    application/xhtml+xml                            xhtml;
    application/xspf+xml                             xspf;
    application/zip                                  zip;

    application/octet-stream                         bin exe dll;
    application/octet-stream                         deb;
    application/octet-stream                         dmg;
    application/octet-stream                         iso img;
    application/octet-stream                         msi msp msm;

    audio/midi                                       mid midi kar;
    audio/mpeg                                       mp3;
    audio/ogg                                        ogg;
    audio/x-m4a                                      m4a;
    audio/x-realaudio                                ra;

    video/3gpp                                       3gpp 3gp;
    video/mp2t                                       ts;
    video/mp4                                        mp4;
    video/mpeg                                       mpeg mpg;
    video/quicktime                                  mov;
    video/webm                                       webm;
    video/x-flv                                      flv;
    video/x-m4v                                      m4v;
    video/x-mng                                      mng;
    video/x-ms-asf                                   asx asf;
    video/x-ms-wmv                                   wmv;
    video/x-msvideo                                  avi;
}

# configuration file /etc/nginx/conf.d/default.conf:
server {
    listen 80;
    server_name _;

    location / {
        port_in_redirect off;
        alias /etc/nginx/html/;
        proxy_http_version 1.1;
        try_files $uri $uri/ //index.html;

        # don't cache anything by default
        add_header Cache-Control "no-store, no-cache, must-revalidate";
    }

    location //static {
        port_in_redirect off;
        alias /etc/nginx/html/static;
        proxy_http_version 1.1;
        expires 1y;

        # cache create react app generated files because they all have a hash in the name and are therefore automatically invalidated after a change
        add_header Cache-Control "public";
    }
}
amalic commented 2 weeks ago

Strangest thing. I did another nginx test deployment, and I accidentally got a response when trying another reload. I found out that reloading multiple times eventually leads to a successful response. Thanks to the nginxdemos/hello image I could see that the sucessfull response was always coming from the same container. After trying to scale the deployment up and down I found out that the container delivering a successful response was always running on the same node.

After adding NodeAffinity to the , I was able to deployment template spec I was able to get a response from all replicas.

Update: The nginx container is not available any more. When I deploy it on all nodes it now sometimes works on some other random node.

Here's the dployment I used:

---
apiVersion: v1
kind: Namespace
metadata:
  name: nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: mylabel
                operator: In
                values:
                - myvalue
      containers:
      - name: nginx
        image: nginxdemos/hello:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
  namespace: nginx
spec:
  selector:
    app: nginx
  ports:
  - name: http
    port: 80
    targetPort: 80
  type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: nginx-test
  namespace: nginx
spec:
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: envoy-gw
      namespace: gwapi-system
  hostnames:
   - "ngx-test.mydomain.mytld"
  rules:
    - backendRefs:
        - name: nginx-service
          kind: Service
          namespace: nginx
          port: 80
          weight: 1
      timeouts:
        backendRequest: 0s
        request: 0s
      matches:
        - path:
            type: PathPrefix
            value: /
arkodg commented 4 days ago

closing this one since it looks like it was related to the backend and was resolved

amalic commented 4 days ago

Update: My previous solution was not correct and did not fix the problem.

Turns out since I am running Karpenter Autoscaler, I had to make sure that Envoy Proxy pods are running on Karpenter nodes by adding a node affinity to the pod spec of the custom-proxy-conf

This is what ended up working for me.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: gwapi-system
spec:
  logging:
    level:
      default: warn
  provider:
    kubernetes:
      envoyDeployment:
        pod:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: autoscaler
                    operator: In
                    values:
                    - karpenter
        replicas: 3
      envoyService:
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
          service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
          service.beta.kubernetes.io/aws-load-balancer-type: external
        externalTrafficPolicy: Cluster
        type: LoadBalancer
    type: Kubernetes

I think this is a workaround for my issue. Once I find the root cause, I will update this issue.