concourse / hush-house

Concourse k8s-based environment
https://hush-house.pivotal.io
29 stars 23 forks source link

Worker termination stuck for 47d #128

Open xtremerui opened 4 years ago

xtremerui commented 4 years ago

When me and @jomsie were deploying hush-house/ci we noticed the make deploy-ci stuck and kubectl get pods shows there is a worker has been in Terminating state for 47 days.

The kubectl describe output

Name:                      ci-worker-0
Namespace:                 ci
Priority:                  0
Node:                      gke-hush-house-ci-workers-79a0ea06-2crm/10.10.0.30
Start Time:                Tue, 24 Mar 2020 19:28:26 -0400
Labels:                    app=ci-worker
                           controller-revision-hash=ci-worker-67c499b88
                           release=ci
                           statefulset.kubernetes.io/pod-name=ci-worker-0
Annotations:               cni.projectcalico.org/podIP: 10.11.7.24/32
                           manual-update-revision: 1
Status:                    Terminating (lasts 9d)
Termination Grace Period:  3600s
IP:                        10.11.7.24
Controlled By:             StatefulSet/ci-worker
Init Containers:
  ci-worker-init-rm:
    Container ID:  docker://3bd89ed4bc3c452c977cae9a879aace70f8c06b9d15b6b19daa7bd734b5140e2
    Image:         concourse/concourse-rc:6.0.0-rc.62
    Image ID:      docker-pullable://concourse/concourse-rc@sha256:dc05e609fdcd4a59b2a34588b899664b5f85e747a8556300f5c48ca7042c7c06
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -ce
      for v in $((btrfs subvolume list --sort=-ogen "/concourse-work-dir" || true) | awk '{print $9}'); do
        (btrfs subvolume show "/concourse-work-dir/$v" && btrfs subvolume delete "/concourse-work-dir/$v") || true
      done
      rm -rf "/concourse-work-dir/*"
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 24 Mar 2020 19:28:58 -0400
      Finished:     Tue, 24 Mar 2020 19:28:58 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /concourse-work-dir from concourse-work-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ci-worker-token-k9sqg (ro)
Containers:
  ci-worker:
    Container ID:  docker://5ca54733ccbf8753e6b3e0581537d92c015d13174df3fbbdadc8e4bfaee9535b
    Image:         concourse/concourse-rc:6.0.0-rc.62
    Image ID:      docker-pullable://concourse/concourse-rc@sha256:dc05e609fdcd4a59b2a34588b899664b5f85e747a8556300f5c48ca7042c7c06
    Port:          8888/TCP
    Host Port:     0/TCP
    Args:
      worker
    State:          Running
      Started:      Tue, 24 Mar 2020 19:28:59 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     7500m
      memory:  14Gi
    Requests:
      cpu:     0
      memory:  0
    Liveness:  http-get http://:worker-hc/ delay=10s timeout=45s period=60s #success=1 #failure=10
    Environment:
      CONCOURSE_REBALANCE_INTERVAL:               2h
      CONCOURSE_SWEEP_INTERVAL:                   30s
      CONCOURSE_CONNECTION_DRAIN_TIMEOUT:         1h
      CONCOURSE_HEALTHCHECK_BIND_IP:              0.0.0.0
      CONCOURSE_HEALTHCHECK_BIND_PORT:            8888
      CONCOURSE_HEALTHCHECK_TIMEOUT:              40s
      CONCOURSE_DEBUG_BIND_IP:                    127.0.0.1
      CONCOURSE_DEBUG_BIND_PORT:                  7776
      CONCOURSE_WORK_DIR:                         /concourse-work-dir
      CONCOURSE_BIND_IP:                          127.0.0.1
      CONCOURSE_BIND_PORT:                        7777
      CONCOURSE_LOG_LEVEL:                        info
      CONCOURSE_TSA_HOST:                         ci-web:2222
      CONCOURSE_TSA_PUBLIC_KEY:                   /concourse-keys/host_key.pub
      CONCOURSE_TSA_WORKER_PRIVATE_KEY:           /concourse-keys/worker_key
      CONCOURSE_GARDEN_BIN:                       gdn
      CONCOURSE_BAGGAGECLAIM_LOG_LEVEL:           info
      CONCOURSE_BAGGAGECLAIM_BIND_IP:             127.0.0.1
      CONCOURSE_BAGGAGECLAIM_BIND_PORT:           7788
      CONCOURSE_BAGGAGECLAIM_DEBUG_BIND_IP:       127.0.0.1
      CONCOURSE_BAGGAGECLAIM_DEBUG_BIND_PORT:     7787
      CONCOURSE_BAGGAGECLAIM_DRIVER:              overlay
      CONCOURSE_BAGGAGECLAIM_BTRFS_BIN:           btrfs
      CONCOURSE_BAGGAGECLAIM_MKFS_BIN:            mkfs.btrfs
      CONCOURSE_VOLUME_SWEEPER_MAX_IN_FLIGHT:     5
      CONCOURSE_CONTAINER_SWEEPER_MAX_IN_FLIGHT:  5
      CONCOURSE_GARDEN_NETWORK_POOL:              10.254.0.0/16
      CONCOURSE_GARDEN_MAX_CONTAINERS:            500
      CONCOURSE_GARDEN_DENY_NETWORK:              169.254.169.254/32
    Mounts:
      /concourse-keys from concourse-keys (ro)
      /concourse-work-dir from concourse-work-dir (rw)
      /pre-stop-hook.sh from pre-stop-hook (rw,path="pre-stop-hook.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from ci-worker-token-k9sqg (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  concourse-work-dir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  concourse-work-dir-ci-worker-0
    ReadOnly:   false
  pre-stop-hook:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ci-worker
    Optional:  false
  concourse-keys:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ci-worker
    Optional:    false
  ci-worker-token-k9sqg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ci-worker-token-k9sqg
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  cloud.google.com/gke-nodepool=ci-workers
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                 From                                              Message
  ----     ------             ----                ----                                              -------
  Normal   Killing            41m (x236 over 9d)  kubelet, gke-hush-house-ci-workers-79a0ea06-2crm  Stopping container ci-worker
  Warning  FailedKillPod      41m (x235 over 9d)  kubelet, gke-hush-house-ci-workers-79a0ea06-2crm  error killing pod: failed to "KillContainer" for "ci-worker" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
  Warning  FailedPreStopHook  41m (x235 over 9d)  kubelet, gke-hush-house-ci-workers-79a0ea06-2crm  Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "ci-worker" in Pod "ci-worker-0_ci(eda22160-fcb7-4150-96d0-93827749746e)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused \"process_linux.go:101: executing setns process caused \\\"exit status 1\\\"\": unknown\r\n"

Might worth checking the end error to help making worker lifecycle management better.

clarafu commented 2 years ago

Seeing this now too with our hush-house deployment. It caused an upgrade to 7.4.0 to take 54 days and still counting because the upgrade of the workers is still not finished. It is currently upgrading workers-worker-8 and still needs to do all the workers from 1-7.

image

Also seeing the same error

image