backube / volsync

Asynchronous data replication for Kubernetes volumes
https://volsync.readthedocs.io
GNU Affero General Public License v3.0
602 stars 69 forks source link

Investigate increasing the readiness probe delay #757

Open tesshuflower opened 1 year ago

tesshuflower commented 1 year ago

Describe the bug

We may need to look into increasing the readiness probe delay to allow for a slower startup.

There was an issue where during the VolSync OLM operator startup, the following sequence happened:

Note the system may have a lot of resources (secrets or pvcs possibly) that caused it to use a lot of memory.

Possible causes of the delayed startup:

Steps to reproduce

Expected behavior

Actual results

Additional context

JohnStrunk commented 1 year ago

In this particular case, config was:

    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3

And from the events, the last "failure" event seems to be ~20 seconds in:

Events:
  Type     Reason          Age    From               Message
  ----     ------          ----   ----               -------
  Normal   Scheduled       5m39s  default-scheduler  Successfully assigned openshift-operators/volsync-controller-manager-676f6bfd-xz67p to ip-10-104-185-93.ap-southeast-2.compute.internal
  Normal   AddedInterface  5m38s  multus             Add eth0 [10.128.26.73/23] from openshift-sdn
  Normal   Pulled          5m38s  kubelet            Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:6562088dcce7296d70990f52f2ee790c3df8694c937291536e974fe078fc4670" already present on machine
  Normal   Created         5m37s  kubelet            Created container kube-rbac-proxy
  Normal   Started         5m37s  kubelet            Started container kube-rbac-proxy
  Normal   Pulled          5m37s  kubelet            Container image "registry.redhat.io/rhacm2/volsync-rhel8@sha256:7207ea4de4a8bb3a2930b974c2122215cb902ab577e4ef1de6e635fd854b6d0a" already present on machine
  Normal   Created         5m37s  kubelet            Created container manager
  Normal   Started         5m37s  kubelet            Started container manager
  Warning  ProbeError      5m29s  kubelet            Readiness probe error: Get "http://10.128.26.73:8081/readyz": dial tcp 10.128.26.73:8081: connect: connection refused
  Warning  Unhealthy   5m29s  kubelet  Readiness probe failed: Get "http://10.128.26.73:8081/readyz": dial tcp 10.128.26.73:8081: connect: connection refused
  Warning  ProbeError  5m18s  kubelet  Liveness probe error: Get "http://10.128.26.73:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy   5m18s  kubelet  Liveness probe failed: Get "http://10.128.26.73:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  ProbeError  5m18s  kubelet  Readiness probe error: Get "http://10.128.26.73:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  5m18s  kubelet  Readiness probe failed: Get "http://10.128.26.73:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
tesshuflower commented 1 year ago

Some more info - it seems that the liveness and readiness probes are setup and should become available before the k8s client cache is done or any leader election, so those are actually likely not the cause.

Note that we do check and install an scc at startup on OpenShift before the probes are setup so if that takes a while (api access is slow) it could delay the probes from becoming available.