Investigate increasing the readiness probe delay

tesshuflower commented 1 year ago

Describe the bug

We may need to look into increasing the readiness probe delay to allow for a slower startup.

There was an issue where during the VolSync OLM operator startup, the following sequence happened:

pod was killed with OOM
The operator subscription was updated to give it more memory: https://docs.openshift.com/container-platform/4.12/rest_api/operatorhub_apis/subscription-operators-coreos-com-v1alpha1.html
Then the operator startup showed an error "component unhealthy": InstallCheckFailedinstall failed: deployment volsync-controller-manager not ready before timeout: deployment "volsync-controller-manager" exceeded its progress deadline

Note the system may have a lot of resources (secrets or pvcs possibly) that caused it to use a lot of memory.

Possible causes of the delayed startup:

many resources in the cluster which cause the k8s client cache startup/init to take a long time
1st startup was killed with OOM - possibly after fixing this to allow more memory, the next startup was delayed due to modifications in the leader election settings (lease timeouts are now longer to comply with openshift recommended settings) - see https://github.com/backube/volsync/pull/687

Steps to reproduce

Expected behavior

Actual results

Additional context

JohnStrunk commented 1 year ago

In this particular case, config was:

    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3

And from the events, the last "failure" event seems to be ~20 seconds in:

Events:
  Type     Reason          Age    From               Message
  ----     ------          ----   ----               -------
  Normal   Scheduled       5m39s  default-scheduler  Successfully assigned openshift-operators/volsync-controller-manager-676f6bfd-xz67p to ip-10-104-185-93.ap-southeast-2.compute.internal
  Normal   AddedInterface  5m38s  multus             Add eth0 [10.128.26.73/23] from openshift-sdn
  Normal   Pulled          5m38s  kubelet            Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:6562088dcce7296d70990f52f2ee790c3df8694c937291536e974fe078fc4670" already present on machine
  Normal   Created         5m37s  kubelet            Created container kube-rbac-proxy
  Normal   Started         5m37s  kubelet            Started container kube-rbac-proxy
  Normal   Pulled          5m37s  kubelet            Container image "registry.redhat.io/rhacm2/volsync-rhel8@sha256:7207ea4de4a8bb3a2930b974c2122215cb902ab577e4ef1de6e635fd854b6d0a" already present on machine
  Normal   Created         5m37s  kubelet            Created container manager
  Normal   Started         5m37s  kubelet            Started container manager
  Warning  ProbeError      5m29s  kubelet            Readiness probe error: Get "http://10.128.26.73:8081/readyz": dial tcp 10.128.26.73:8081: connect: connection refused
  Warning  Unhealthy   5m29s  kubelet  Readiness probe failed: Get "http://10.128.26.73:8081/readyz": dial tcp 10.128.26.73:8081: connect: connection refused
  Warning  ProbeError  5m18s  kubelet  Liveness probe error: Get "http://10.128.26.73:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy   5m18s  kubelet  Liveness probe failed: Get "http://10.128.26.73:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  ProbeError  5m18s  kubelet  Readiness probe error: Get "http://10.128.26.73:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  5m18s  kubelet  Readiness probe failed: Get "http://10.128.26.73:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

tesshuflower commented 1 year ago

Some more info - it seems that the liveness and readiness probes are setup and should become available before the k8s client cache is done or any leader election, so those are actually likely not the cause.

Note that we do check and install an scc at startup on OpenShift before the probes are setup so if that takes a while (api access is slow) it could delay the probes from becoming available.

backube / volsync

Investigate increasing the readiness probe delay #757