carvel-dev / kapp-controller

Continuous delivery and package management for Kubernetes.
https://carvel.dev/kapp-controller
Apache License 2.0
268 stars 104 forks source link

placeholder: can kapp-controller signal kapp to transiently suppress failures when success is expected to take a few cycles? #349

Open joe-kimmel-vmw opened 3 years ago

joe-kimmel-vmw commented 3 years ago

Describe the problem/challenge you have Users of kapp-controller, including UI dashboards that rely on kapp-controller under the hood, often see a "reconcile failed" message which resolves into a "reconcile succeeded" message if they wait for some tens of seconds. This false failure undermines user confidence and presents rough edges to end-users.

Describe the solution you'd like Are there case where kapp-controller "knows" that it's very likely to take a few retry cycles, e.g. when we first have to resolve image pull secrets? If we can recognize those cases, can we invoke kapp with a flag along the lines of "show_failures_as_retries=5" and then kapp will either leave the status as "Reconciling" or update to e.g. "Retrying Reconciling" (might want to discuss with users whether intermediate state would be useful in their UX) until after N failures.

Anything else you would like to add: [Additional information that will assist in solving the issue.]


Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

aaronshurley commented 3 years ago

Accepting this issue and moving it to the unprioritized backlog, for now. If we receive more signal for this feature then we can bump the priority.

joe-kimmel-vmw commented 3 years ago

just to push a little on signal, this something that I saw previously and that 100+ people saw during a live demo of TMC, which the primary demo-driver wasn't ready for but which their secondary driver was basically expecting. Respect we may still wait to prioritize relative to other incoming signals.

aaronshurley commented 3 years ago

Thanks @joe-kimmel-vmw for the additional info. I'd be inclined towards increasing the priority of this issue or at least more closely monitoring this issue in the near term so that we can move quickly.

@vibhas thoughts?

vibhas commented 3 years ago

I agree this is important. I would say that let's closely monitor this issue for now and then prioritize it based on a bit more evidence and feedback.