cloudfoundry / korifi

Cloud Foundry on Kubernetes
Apache License 2.0
316 stars 63 forks source link

[Bug]: Rolling update feature doesn't work with expired container registry credentials #2913

Open beyhan opened 1 year ago

beyhan commented 1 year ago

What happened?

I pushed a docker app from a private container registry. I tried to restart the app with cf restrart <my-app> --strategy=rolling when the container registry secrets which failed living the app in an unhealthy state. I noticed that when I executed the restart command my container registry credentials where expired.

What you expected to happen

I would expect that the restart command fails but the app stays in healthy state.

Acceptance Criteria

GIVEN that I have a docker based app WHEN I try to execute a command which supports --strategy=rolling with expired credentials THEN I expected the command to fails but the app instance should stay in a healthy sate.

How to reproduce it (as minimally and precisely as possible)

  1. cf push <my-app> --docker-image <docker-image-path>:latest --docker-username <my-user-name>
  2. Invalidate your docker credentials/token
  3. cf restart <my-app> --strategy=rolling

Anything else we need to know?

No response

Environment

This happens with https://github.com/cloudfoundry/korifi/releases/tag/v0.9.0

danail-branekov commented 1 year ago

Rolling strategy is implemented using K8S rolling update capabilities of the scheduler. At the moment korifi uses statefulsets to run the app workloads. Rolling update for statefulsets stops the old instance before starting the new one, for ordering reasons. If the app has more than 1 instance this downtime can be avoitded.

Better yet, if we switch the implementation to use deployments instead we would not see downtime even with one instance. However doing this means we cannot supprot the CF_INSTANCE_INDEX env var. We have had many conversations historically about its importance, but no decisions have been taken.

beyhan commented 1 year ago

I see, the downtime is not good and provides a different experience for app update operations from what Diego offers and it is worth to be discussed. Could you maybe provide a link to the that discussions?

Additionally, I would expect that at the end my instance is healthy again when the restart operation finishes which isn't the case. Apps with one instance will be down after the restart operation and apps with multiple instances will have at least one instance "down". Will this be mitigated if the switch to deployments is implemented?