At the time of writing the flake had occurred every few hours in the periodic run during the weekend, but this correlated with infrastructure issues in gke
We couln't find any occurrence of this flake on eks
We have ruled out any correlation to the Gateway API fearure, since the flake is much older than this change
CLI implementation details:
The cf cli keeps polling the process for stats until the API reports the process as CRASHED
Once the cf cli sees the CRASHED state it builds a new process summary which involves getting the process stats again.
The two requests to get the process stats happen in quick succession but may sometimes deliver different statuses. That's why in most cases the app shows as starting rather than crashed.
Korifi API implementation details:
Korifi reports a procees as CRASHED when its repective pod's container is in Terminated status. This is a signal for the CLI to stop polling for process status.
The check described above is potentially inaccurate since the pod being checked is owned by a stateful set and the statefulset may cause some status transitions. For example we have seen the following statefulset event during a successful cf push on kind:
Warning RecreatingFailedPod StatefulSet/7234e40a-dd85-4892-b3cf-b15e1b81fb55-cf--6312669728 StatefulSet cf-space-9dd7b83a-b495-42c9-9455-9dd1462f11a2/7234e40a-dd85-4892-b3cf-b15e1b81fb55-cf--6312669728 is recreating failed Pod 7234e40a-dd85-4892-b3cf-b15e1b81fb55-cf--6312669728-0
As statefulsets restart crashed pods, the implementation above cannot provide a stable crashed state. We need to come up with a more reliable implementation. Back in the eirini-controller days we had implemented some heuristics, we could consider getting some inspiration.
Description
We have started seeing muliple flakes like these:
Some thoughts on these:
CRASHED
CRASHED
state it builds a new process summary which involves getting the process stats again.starting
rather thancrashed
.Korifi API implementation details:
CRASHED
when its repective pod's container is inTerminated
status. This is a signal for the CLI to stop polling for process status.crashed
state. We need to come up with a more reliable implementation. Back in the eirini-controller days we had implemented some heuristics, we could consider getting some inspiration.