emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.38k stars 687 forks source link

terminationGracePeriodSeconds is not working #4112

Open eilonmonday opened 2 years ago

eilonmonday commented 2 years ago

Describe the bug Doesn't matter what is the number we set as terminationGracePeriodSeconds, the pod is terminated after 2-3 seconds, even if it still has open connections...

To Reproduce Steps to reproduce the behavior: kubectl delete pod

Expected behavior I expect the pod to be alive for the time I set, or if it does not have an open connection to get terminated. but at the moment it is dead with the open connections...

chart: edge-stack-7.2.2 ambassador version: 2.1.2 EKS version: 1.20

Deployment yaml:

apiVersion: v1
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "9"
      meta.helm.sh/release-name: stage-emissary
      meta.helm.sh/release-namespace: emissary
    generation: 14
    labels:
      app: ambassador-v2
      app.kubernetes.io/instance: stage-emissary
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: emissary
      app.kubernetes.io/part-of: stage-emissary
      env: staging
      helm.sh/chart: emissary-ingress-7.2.2
      product: aes
      region: use1
    name: stage-emissary
    namespace: emissary
    resourceVersion: "799867470"
    uid: d1812dbd-1533-4879-9ea8-4449ffc397be
  spec:
    progressDeadlineSeconds: 600
    replicas: 2
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app.kubernetes.io/instance: stage-emissary
        app.kubernetes.io/name: emissary
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 1
      type: RollingUpdate
    template:
      metadata:
        annotations:
          checksum/config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
          co.elastic.logs/enabled: "true"
        creationTimestamp: null
        labels:
          app: ambassador-v2
          app.kubernetes.io/instance: stage-emissary
          app.kubernetes.io/managed-by: Helm
          app.kubernetes.io/name: emissary
          app.kubernetes.io/part-of: stage-emissary
          env: staging
          helm.sh/chart: emissary-ingress-7.2.2
          product: aes
          profile: main
          region: use1
      spec:
        containers:
        - env:
          - name: HOST_IP
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.hostIP
          - name: AMBASSADOR_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: AES_ACME_LEADER_DISABLE
            value: "true"
          - name: AMBASSADOR_DRAIN_TIME
            value: "180"
          - name: AMBASSADOR_ENVOY_API_VERSION
            value: V2
          - name: AMBASSADOR_ID
            value: emissary
          - name: REDIS_URL
            value: stage-emissary-redis:6379
          image: docker.io/emissaryingress/emissary:2.1.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 6
            httpGet:
              path: /ambassador/v0/check_alive
              port: admin
              scheme: HTTP
            initialDelaySeconds: 15
            periodSeconds: 6
            successThreshold: 1
            timeoutSeconds: 1
          name: emissary-ingress
          ports:
          - containerPort: 8080
            name: http
            protocol: TCP
          - containerPort: 8877
            name: admin
            protocol: TCP
          readinessProbe:
            failureThreshold: 6
            httpGet:
              path: /ambassador/v0/check_ready
              port: admin
              scheme: HTTP
            initialDelaySeconds: 15
            periodSeconds: 6
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 1300m
              memory: 1Gi
            requests:
              cpu: 750m
              memory: 600Mi
          securityContext:
            allowPrivilegeEscalation: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        - env:
          - name: SIGSCI_ACCESSKEYID
            value: XXXXXXXXX
          - name: SIGSCI_SECRETACCESSKEY
            value: XXXXXXXXX
          - name: SIGSCI_ENVOY_EXPECT_RESPONSE_DATA
            value: "1"
          - name: SIGSCI_ENVOY_GRPC_ADDRESS
            value: localhost:9999
          image: signalsciences/sigsci-agent:latest
          imagePullPolicy: Always
          name: sigsci-agent
          ports:
          - containerPort: 9999
            name: grpc
            protocol: TCP
          resources:
            limits:
              cpu: 700m
              memory: 900Mi
            requests:
              cpu: 120m
              memory: 640Mi
          securityContext:
            readOnlyRootFilesystem: true
            runAsUser: 100
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /sigsci/tmp
            name: sigsci-tmp
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext:
          runAsUser: 8888
        serviceAccount: stage-emissary
        serviceAccountName: stage-emissary
        terminationGracePeriodSeconds: 180

Pods logs right after I delete it:


time="2022-02-10 16:39:37.8023" level=error msg="goroutine \":signal_handler:0\" exited with error: received signal terminated (triggering graceful shutdown)" func="github.com/datawire/dlib/dgroup.(*Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=entrypoint PID=1 THREAD=":signal_handler:0"
time="2022-02-10 16:39:37.8025" level=info msg="sending SIGINT" func="github.com/datawire/dlib/dexec.(*Cmd).Start.func1" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:211" CMD=entrypoint PID=1 THREAD=/envoy
time="2022-02-10 16:39:37.8033" level=info msg="sending SIGINT" func="github.com/datawire/dlib/dexec.(*Cmd).Start.func1" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:211" CMD=entrypoint PID=1 THREAD=/diagd
time="2022-02-10 16:39:37.8034" level=info msg="shutting down (gracefully)..." func="github.com/datawire/dlib/dgroup.(*Group).launchSupervisors.func1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:238" CMD=entrypoint PID=1 THREAD=":shutdown_logger"
time="2022-02-10 16:39:37.8034" level=info msg=Done func=github.com/datawire/ambassador/v2/cmd/ambex.Main2 file="/go/cmd/ambex/main.go:948" CMD=entrypoint PID=1 THREAD=/ambex
[2022-02-10 16:39:37 +0000] [18] [INFO] Handling signal: int
time="2022-02-10 16:39:37.8122" level=info msg="Memory Usage 0.40Gi (40%)\n    PID 1, 0.24Gi: busyambassador entrypoint \n    PID 18, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 23, 0.06Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1 \n    PID 34, 0.11Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 180 -l error --bootstrap-version 2 " func="github.com/datawire/ambassador/v2/pkg/memory.(*MemoryUsage).Watch" file="/go/pkg/memory/memory.go:43" CMD=entrypoint PID=1 THREAD=/memory
time="2022-02-10 16:39:37.8398" level=info msg="finished successfully: exit status 0" func="github.com/datawire/dlib/dexec.(*Cmd).Wait" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:255" CMD=entrypoint PID=1 THREAD=/envoy dexec.pid=34
[2022-02-10 16:39:37 +0000] [23] [INFO] Worker exiting (pid: 23)
[2022-02-10 16:39:38 +0000] [18] [INFO] Shutting down: Master
time="2022-02-10 16:39:38.0786" level=info msg="finished successfully: exit status 0" func="github.com/datawire/dlib/dexec.(*Cmd).Wait" file="/go/vendor/github.com/datawire/dlib/dexec/cmd.go:255" CMD=entrypoint PID=1 THREAD=/diagd dexec.pid=18
time="2022-02-10 16:39:38.0787" level=info msg="  final goroutine statuses:" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:84" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0788" level=info msg="    /ambex                   : exited without error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0788" level=info msg="    /diagd                   : exited without error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0788" level=info msg="    /envoy                   : exited without error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0788" level=info msg="    /external_snapshot_server: exited without error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0788" level=info msg="    /healthchecks            : exited without error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0789" level=info msg="    /memory                  : exited without error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0789" level=info msg="    /snapshot_server         : exited without error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0789" level=info msg="    /watcher                 : exited without error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0790" level=info msg="    :signal_handler:0        : exited with error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=entrypoint PID=1 THREAD=":shutdown_status"
time="2022-02-10 16:39:38.0790" level=error msg="shut down with error error: received signal terminated (triggering graceful shutdown)" func=github.com/datawire/ambassador/v2/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=entrypoint PID=1
rpc error: code = Unknown desc = Error: No such container: ca7e605a89a4f3583c7d4ea5b0949d3fa1524f9100ebd70b16a000cf2e478de4%
cindymullins-dw commented 2 years ago

Comment from support: I did some testing on my end with a similar configuration, and also was unable to get terminationGracePeriodSeconds to behave like you expected. It appears to us terminationGracePeriodSeconds will not delay the SIGTERM that your pods receive, instead it allows for a longer duration after the SIGTERM is sent for the underlying application to gracefully exit. The problem is that Envoy starts immediately shutting down and closing connections as soon as it receives a SIGTERM, so you will need to use a preStop hook to delay the period from when the pod is scheduled as unrouteable (i.e. it enters the terminating state) and when the SIGTERM is actually sent. Luckily, a lifecycle option for the main container, which would allow for a preStop hook, was added to the latest version of AES, and you can find resources for it here: https://github.com/emissary-ingress/emissary/pull/4030 https://github.com/emissary-ingress/emissary/blob/5f01e923a297182a08217b7651bfc1a77735391f/charts/emissary-ingress/values.yaml.in#L403 In the PR, there is an example that should work if the main container is running behind a load balancer that is slow to deregister pods.

Could you please try that and let us know?

Eugen-Kuzovlev commented 1 year ago

chart: edge-stack-7.5.2 ambassador version: 2.4.2 EKS version: 1.23

I confirm that after adding lifecycle section for the main pod whatever you put there terminationGracePeriodSeconds works as intended