Better parse error handling

jroper commented 5 years ago

Working with custom resources in Skuber is very difficult due to parse errors being fatal. Skuber assumes that parsing will always be successful, and doesn't offer any opportunity to handle parse errors gracefully. So, if you're not using validation in your CRDs (or if you don't have the validation quite right), then an end user updating a malformed resource will result in the entire operator, for both good and the bad resource, from ceasing to work - any events that are received for that resource will cause the watch source to fail, and any list operations for that type of resource will fail.

Here's what I would suggest should be modified to make writing robust operators much simpler:

WatchEvent should have an additional property (or be wrapped in a higher level object) that can notify when a parse error is encountered. This will allow an operator to see the parse error, and update the status of the resource with an error, to allow the user to learn what they did wrong.
ResourceList should have an additional property for items that were failed to be parsed, and the Format for it should capture all parse failures, and instead of failing to parse the entire list, put the errors into this additional property. This will allow the operator, when it does its initial request for the current set of resources before starting the poll, to set an error in the status of those incorrectly formatted resources.

With the following modifications (or equivalent), I think it will be much easier to write operators using skuber.

doriordan commented 5 years ago

Re. WatchEvent - given that WatchEvent has a non-optional field for the successfully parsed resource, to keep backwards compatibility something else would need to be used to convey the parse error information to the operator, for example we could use the error sink suggested for #269. In the case of your suggestion for adding error details to ResourceList it seems feasible and useful, but I would need to look into the details of what would be involved given current list parsing in Skuber, which leans on the implicit Play readers for lists.

jroper commented 5 years ago

Yeah, Plays traversableReads definitely couldn't be used. As a work around now I'm using CustomResource[JsValue, JsValue] in the API, then I convert the resource back to a JsValue, and then decode using my defined CustomResource[MySpec, MyStatus]. Generalising that might look like this:

def listReadsCollectingErrors[Sp: Reads, St: Reads]: Reads[(List[(CustomResource[JsValue, JsValue], JsError)], List[CustomResource[Sp, St])] = {
  implicitly[Reads[List[CustomResource[JsValue, JsValue]]]].map { resources =>
    val results = resources.map { jsResource =>
      Json.toJson(jsResource).validate[CustomResource[Sp, St]] match {
        case JsSuccess(resource) => Right(resource)
        case err: JsError => Left((jsResource, err))
      }
    }
    (jsResults.collect { case Left(err) => err }, jsResults.collect { case Right(resource) => resource })
  }
}

The reason that the it's a list of (CustomResource[JsValue, JsValue], JsError), and not just the error, is so that you can know which resource has the error, and update the status accordingly.

As for WatchEvent, one problem with using a separate Sink is that if you do then go and modify it to set the error in the status, that will generate another watch event. Generally when implementing operators, you need to implement something stateful that caches both values you've seen, and values you've updated, so that you can distinguish watch events that you generate because you updated the status (and thus avoid infinite update loops) from watch events that something else generated and you need to handle. This is relatively straight forward to do in a single stream, using Flow.scan you can fold over the cache, but if you introduce a separate stream for error handling, then your cache needs to coordinate between the streams, which brings up concurrency and thread safety issues.

doriordan commented 5 years ago

I am coming around to a preference for adding an enriched event type that can represent either errors or normal events. That would require new API watch methods that return sources of these new event types and more than likely deprecating the old watch methods, I'll need to flesh that out.

doriordan / skuber

Better parse error handling #270