fiaas / fiaas-deploy-daemon

fiaas-deploy-daemon is the core component of the FIAAS platform
https://fiaas.github.io/
Apache License 2.0
55 stars 31 forks source link

Problem with application status #195

Open herodes1991 opened 1 year ago

herodes1991 commented 1 year ago

We just detected a problem with the application status result. We have a user that set a smaller readiness timeout than the liveness and the app is constantly finishing in a FAILED status. What do you think about getting the grater value between readiness and liveness?

https://github.com/fiaas/fiaas-deploy-daemon/blob/727967d9543d6fc0d6f9284421461a0e131a2f1b/fiaas_deploy_daemon/deployer/kubernetes/ready_check.py#L39

oyvindio commented 1 year ago

We just detected a problem with the application status result. We have a user that set a smaller readiness timeout than the liveness and the app is constantly finishing in a FAILED status.

I think some more details might be necessary to understand what the problem is in this case. Does the deployment rollout complete successfully at the Kubernetes level? How are the healthchecks for this application configured? Is it possible to create an example application configuration/application resource which can reproduce the issue?

If the deployment rollout does not complete successfully, the healthchecks/readiness probe for the application might need to be adjusted. If the rollout completes successfully but takes longer than the ReadyCheck timeout because of external factors such as image pull or pod scheduling delays, one option could be to increase the ready-check-timeout-multiplier config flag. Note that this will increase the effective ReadyCheck timeout for all applications deployed by the fiaas-deploy-daemon instance.

What do you think about getting the grater value between readiness and liveness?

ReadyCheck uses initial_delay_seconds from the readiness probe because during a deployment rollout it is necessary to wait for the readiness probes for each pod to go to Success to ensure that an appropriate number of pods are available at all times. My understanding is that since the default state of liveness probes is Success, the deployment controller does not wait for the liveness probe initial_delay_seconds during rollout. Liveness probes can only influence the result of a rollout by failing. ReadyCheck tries to determine whether the deployment rollout was successful within a timeout based on an estimate of how long a rollout might take to complete. If liveness probe initial_delay_seconds isn't a factor in how much time rollouts take in practice, then I'm not sure that using it to calculate the ReadyCheck timeout is an ideal solution.

herodes1991 commented 1 year ago

Yes, the user used initial_delay_seconds for liveness probe as 30 with some retrying, and the readiness as 3. We have configured in the full namespace the ready-check-timeout-multiplier as 30. The application was configured with 1 replica. What happened? The app needs more than 90 seconds to become healthy (approximately 120 seconds) but the fiaas application status becomes FAILED after 90 seconds (30*3) 😔

mortenlj commented 1 year ago

The app needs more than 90 seconds to become healthy (approximately 120 minutes)

Is this a typo, or does the app need 120 minutes to become ready to serve requests?

If so, you need to tell them to go back and build better apps. That kind of thing is not something that can or should be solved in the platform, that needs to be solved in application code.

herodes1991 commented 1 year ago

oh, yes, it was a typo, it was 120 seconds 😅

oyvindio commented 1 year ago

Yes, the user used initial_delay_seconds for liveness probe as 30 with some retrying, and the readiness as 3. We have configured in the full namespace the ready-check-timeout-multiplier as 30. The application was configured with 1 replica. What happened? The app needs more than 90 seconds to become healthy (approximately 120 seconds) but the fiaas application status becomes FAILED after 90 seconds (30*3) 😔

If you mean that it takes approximately 120 seconds before the readiness probe is successful (often or always), then it sounds to me like the application would benefit from a longer initial_delay_seconds on the readiness probe to accommodate the time it actually takes it to become ready. This should allow more time for the rollout to complete at the Kubernetes level, and would also increase the effective timeout in ReadyCheck. Does that seem like a reasonable solution, or is there a reason why readiness initial delay can't be increased?