aerokube / moon

Browser automation solution for Kubernetes and Openshift supporting Selenium, Playwright, Puppeteer and Cypress
http://aerokube.com/moon/latest
Apache License 2.0
223 stars 20 forks source link

Moon browsers are blocking gke cluster scale down #308

Open mhubig opened 2 years ago

mhubig commented 2 years ago

Hi there,

recently we are getting warnings from GKE about moon browser pods blocking cluster scale down:

"Pod is blocking scale down because it’s not backed by a controller"

{
  "insertId": "96beaab8-e3c3-4773-a733-f9559c041b0f@a1",
  "jsonPayload": {
    "noDecisionStatus": {
      "measureTime": "1643992124",
      "noScaleDown": {
        "nodes": [
          {
            "node": {
              "mig": {
                "zone": "europe-west4-b",
                "name": "gke-hybris-cluster-n-hybris-node-pool-702525f4-grp",
                "nodepool": "hybris-node-pool-nonprod"
              },
              "cpuRatio": 43,
              "name": "gke-hybris-cluster-n-hybris-node-pool-702525f4-cqjm",
              "memRatio": 43
            },
            "reason": {
              "parameters": [
                "chrome-95-0-7d3784d7-9ccd-4510-8006-a37f60eb7c22"
              ],
              "messageId": "no.scale.down.node.pod.not.backed.by.controller"
            }
          }
        ],
        "nodesTotalCount": 1
      }
    }
  },
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "cluster_name": "hybris-cluster-nonprod",
      "location": "europe-west4",
      "project_id": "hybris-prod-0815"
    }
  },
  "timestamp": "2022-02-04T16:28:44.919959722Z",
  "logName": "projects/hybris-prod-0815/logs/container.googleapis.com%2Fcluster-autoscaler-visibility",
  "receiveTimestamp": "2022-02-04T16:28:45.608040844Z"
}

We can try fixing this with an annotation like cluster-autoscaler.kubernetes.io/safe-to-evict: "true" but maybe there is a better solution?

mhubig commented 2 years ago

Hmm ok I see: The real problem seems to be some stuck browser pods ... the logs of this pods are containing an endless list of Waiting X server... entries.

vania-pooh commented 2 years ago

@mhubig you have to check defender container logs of such pods. Usually this could be because of DNS issue or Kubernetes API overload.