controller: reconcile func returns an error

Makes the ReconcileFunc to return an error.

Alters the Workflow reconciliation to abort the workflow if a precondition task or any of the concurrent tasks return an error.

Alters the Workflow reconciler to accept an optional ErrorHandler ReconcileFunc.

Note: I'm not sure this is a good idea. Arguably, reconciliation tasks should never fail. Rather, when an error is detected, the task should move the resources to a state that reflects the inconsistency and wait for the next event to trigger, when possibly the error will not happen again.

The main reason for the change was making possible (or easier) for a workflow reconciliation func to abort. Another option could be letting the workflow continue and carry the error to the subsequent tasks, postcondition etc – using the sync.Map state (#29) or the error arguments. These subsequent tasks, in turn, would add safe-guard clauses that check for errors carried over from previous steps.

I am wondering what are we calling an error here and do we need to give guidelines on what an error is.

For example I don't think a merge conflict due to the resource being reconciled is a is an error. In this there should already be an event for the newer version of that resource in the queue. Resource status should be updated, but if that is what the error handler does then okay but we should commend that. When it comes to cluster resources we should be getting those events, and if the reconcile tries to change modify those resources before it can I don't think it can be called an error.

What I do think can be classed as an error is the failure to within in the user code, ie unmarshaling fails. For this I wonder should the user be forced to handle that error in at the source. This idea of a generic error handler seems like a catch all. In the case of a code failure this may require a requeuing of the event.

Lastly there is the error when external resources fail. The example here can be requests to route53 being rate limited. For this I also think the user should be force to try handle the failure at the source with a limited number of retries. I am aware this would block the workflow from processing the next event. So again the a resource status should also be updated when the limited retries fail, and the event re-queued.

I am even starting to wonder if the updating of the status of a resource can be used as the requeue trigger. I wonder if we are using error in the correct fashion for what we are trying to achieve or if we are using error because that is the way the controller-runtime does it. I will agree there needs to be a way to short circuit the workflow to get to the status updater. But I am not sure if the status updater should even be a different part of the workflow for a given resources.

As one approach could be to define a status updater function that is defered at the top of the reconcile, which is pass a status pointer. Once the reconcile function gets into an unrecoverable state the status pointer is updated and the function exits causing the status to get updated. This would cut that work follow short, provide information to the user by the resource status and also create an event in the queue due to the update.

Ignore me if I'm missing the point but rather then the semantics of what consists of an error and what not, isn't it more important that we are able to log errors as specifically as we can in order to help the user find a way forward? Any reliable hint on what might be going wrong is better than silently failing (from a user perspective).

@Boomatang

I am wondering what are we calling an error here

For the purpose of the library, I guess my perspective is we're not calling anything an error per se. Rather, we're just providing the user with the option to tag something as an error.

Without this option, I guess the message we send is that either errors are in general unacceptable and must be handled at the source (as you say), or – what I believe is likely going to happen in many cases – that an error to be carried over (to be handled or simply logged at another level) must be injected in the sync.Map struct.

do we need to give guidelines on what an error is.

Maybe, via documentation, as "good practices". I think the examples you mentioned are a good start.

General rule of thumb, I guess:

If an error suggests that a new event must have been enqueued (and it's now waiting to be processed), then log the error, update the status of a resource (if possible), and raise the error to abort the reconciliation (if that's what you want to do). The reconciliation call for the next event in the queue will offer a chance to retry.
Updating the status of a resource may, in itself, already trigger another reconciliation event, so regardless if the error suggests new events waiting in the queue or not, this response to an error may suffice to trigger a retry. However, keep in mind that updating the status of a resource may fail and/or even be the first error being handled in the place.
Raising an error is the easiest way to abort a reconciliation call and wait for the next event, but keep in mind that no automatic retry will happen until a next event is picked from the queue.

Examples of errors that typically suggest events in the queue or not:

Typically suggest the presence of another event pending to be processed in the queue: failure to update a resource due to the resource having been modified, but only as long as the resource kind is being watched by the controller.
Errors that will not generate a new event out of the box: mishandling in the code, bad requests, internal server errors, failures in the communication with external services, rate-limited by the API server.

should the user be forced to handle that error in at the source. This idea of a generic error handler seems like a catch all.

Although I would probably prefer handling the error at the source for most cases, I still see 2 use cases for raising the error.

First use case is to not repeat yourself. In a workflow with multiple tasks, there might be specific error types that you do not want to handle at all tasks.

The second use case is to abort a workflow. Without this PR, currently IDK how one could abort a workflow other than by panicking.

I am even starting to wonder if the updating of the status of a resource can be used as the requeue trigger.

It can, but remember that updating the status may also fail and/or be the error being handled in the place.

@thomasmaas

rather then the semantics of what consists of an error and what not, isn't it more important that we are able to log errors

IMO, it's up to the user of the library to decide what to do with an error that is raised. However, if nothing suppresses an error that is triggered by a reconciliation function, the top level of the reconciliation, which is the controller provided by the library, will log the error for the user:

https://github.com/Kuadrant/policy-machinery/blob/e445f50ddc2d1d4ffc669a840cc1d317aed5625a/controller/controller.go#L260

My 2 cents on error handling

The policy machinery policy needs to handle panics
Resource status should be based on the cluster state and not some error in any of the error types that can happen during reconciliation. I would not relate in any way resource status reconciliation with error handling when reconciling anything. (Yes, I have done otherwise in the past, but that's why I suggest not to do... learnings)
The policy machinery needs to state very clearly what it means returning an error from a Task/Workflow or whatever. For the the controller-runtime, returning an error means basically retrying. Plus some logging provided for the users of the controller-runtime controllers. Worth mentioning that retying is done in a smart way: exponential back-off delay between retries. For safeness. Additionally, you can add rate limiting for the number of retries when returning errors saying 10 events per second max.
I would not add an specific stage for error handling. I do not see any benefit from doing that.
Types of errors. There are many. And the reaction when the controllers get one of them depends on the nature of the error. I will outline some of them, but the list is not comprehensive. It would be too long maybe.
- Input data syntax errors: (AKA unmarshalling or deserialization errors). No need to retry. The status handler should catch that syntax error and report in the status.
- Code error, i.e. div by 0: no need to retry. The outcome will not change when retrying, unless the input changes. The operator needs to shout, yell, cry and raise a flag hoping somebody will notice and report engineering team about the bug. Usually error log lines and namespace level event generation
- k8s update conflict errors: retry is needed. Logging also is interesting, but I do not think it should even be INFO/ERROR level. Maybe debug level. It is expected that in the next iteration, the update operation to succeed.
- Errors on external connections like k8s API server and other third party endpoints like AWS. Here also depend:
- Server not found, connection open timeout, dns resolution error: No need to retry. Something external needs to change. It can be some configuration or k8s resource. In any case an update should trigger another task to retry.
- Connection read/write timeouts, response with 5XX: needs to retry. This is supposed to be temporary network outage or server downtime.
- Response with 4XX: No need to retry. Credentials error (due to wrong or lack of). Something external needs to change. It can be some configuration or k8s resource. In any case an update should trigger another task to retry.

Thanks for the input, @eguzki.

Resource status should be based on the cluster state and not some error in any of the error types that can happen during reconciliation. I would not relate in any way resource status reconciliation with error handling when reconciling anything. (Yes, I have done otherwise in the past, but that's why I suggest not to do... learnings)

TBH, maybe I did not understand this part. Maybe you have a clear semantics on what an error is (should be) in your opinion.

IMO, e.g., trying to update a resource during reconciliation and failing due to being rate-limited by the API server is an error that induces a new state for the cluster. In this case, the new state of the cluster is that reconciliation could not be completed, and perhaps a policy previously flagged as Accepted/Enforced no longer is. It is not the update error that goes to the status of the policy, but a reflection of the new state of the cluster regarding this policy, i.e., that it is no longer Accepted/Enforced. It all started with an error though.

The policy machinery needs to state very clearly what it means returning an error from a Task/Workflow or whatever.

Here returning an error is mainly a mean to something (e.g. to abort the workflow, to gather handling at an aggregation level, etc.) The error has a meaning in its own (i.e. an exception happened), but the decision of raising this error (instead of handling it at the source) or not is a prerogative of the user of the lib, to achieve some logic in the code.

For the the controller-runtime, returning an error means basically retrying.

In the Policy Machinery, there is zero promise of retrying.

I would not add an specific stage for error handling. I do not see any benefit from doing that.

I think the main benefits are 1) structuring the logic in a series of nested/chained reconcile funcs, and 2) the possibility to abort a workflow. Still, it's up to the user of the lib to decide when to use this resource, when to handle errors at the source, and when to simply let errors raise all the way.

Types of errors. There are many.

I guess we don't need an extensive list. I typology of errors should be fine.

Let me try to summarise the problems we want to solve here and the options I can see to tackle them.

Problem 1: There must be a way to halt a workflow or even the entire reconciliation without panicking.

Problem 2: There are common patterns for dealing with errors that users will not want to repeat across multiple tasks of a workflow but to handle at a single point in the code.

I see 2 options to solve these problems. (There may be a 3rd, 4th option that I do not see.)

We don't need this PR to solve either of the problems described. This is OPTION A that I see.

For both problems, users today can rely on the shared sync.Map, included in the signature of the ReconcileFunc. When a condition occurs in a task ("source") that is either a reason to halt a workflow or the reconciliation altogether (Problem 1) or a common error that could have otherwise been triggered by some other concurrent task of a workflow instead of the source and it would equally be handled by yet another task ahead (Problem 2), some data can be injected by the source into the shared sync.Map.

To address Problem 1, all subsequent tasks in a possibly complex reconciliation workflow need to implement a safe-guard clause that makes the task to skip in the presence of that piece of data. To address Problem 2, it's the opposite; a subsequent task actually triggers if a particular piece of data is present in the shared map.

OPTION A works. However it requires that the source and all other relevant tasks in a workflow to agree on the specific keys which to read from the shared sync.Map and their meaning. These keys will be mixed together with any other data written to the shared map; semantics to distinguish one kind from another is completely up to the code to be given.

With regards to Problem 1 specifically, I happen to see OPTION A as additionally cumbersome, due to the number of safe-guard clauses that need to spread across the tasks only for the sake of supporting halting a workflow.

The other option I see is this PR (OPTION B). Here, raising an error that is not handled by any error handler set at any of a series of nested workflows causes the reconciliation process to halt (solves Problem 1). Setting an error handler to a workflow, on the other hand, make it catch errors that are raised within that workflow. This task then becomes a single point in the code that handles errors for that particular set or subset of tasks wrapped within a same workflow (solves Problem 2).

Regarding OPTION B and Problem 2, it's worth noting that it's still up to each source task to decide which errors to raise and which ones to handle themselves. So it's not all errors that will end up in the error handler necessarily. It's also worth noting that an error handler belongs to a particular workflow, which can be nested inside an outer workflow. The nested (inner) error handler only catches errors raised inside the nested (inner) workflow it belongs, and it can decide whether to suppress these error (and thus let the outer workflow to continue) or to raise them/some of them again (thus becoming a problem of the outer workflow now.)

This last explanation helps visualise Problems 1 and 2 combined (i.e. partially caught by a common "error handler" task but still raised afterwards to halt the workflow). Both options described above, A and B, could be employed.

Kuadrant / policy-machinery

controller: reconcile func returns an error #31