Feature Request: support error handling policy

Currently, the execution error of an NRI plugin seems to be handled on three layers: NRI plugins, NRI adaptation, and CRI runtime.

The NRI plugin itself determines whether to throw the error to the nri adaptation. Only the thrown errors can be handled in other layers.
The NRI adaptation catches the error returned by a plugin and checks if the error is fatal. A fatal error will cause the plugin to be closed, but neither abort the calling of other plugins nor throw the error. A non-fatal error will abort the calling of the other plugins and throw the error to the cri runtime (e.g. containerd).
The CRI runtime catches the error and sometimes fails the CRI request while sometimes does not. In containerd v1.7.0, errors from hook events, including RunPodSandbox, CreateContainer, StartContainer, and UpdateContainer are handled in CRI, while errors from the other hook events are ignored by containerd.

In our use cases, we may adjust the cgroup resources or manage devices for containers according to enhance Pod QoS and scheduling. NRI is an approach to do these works synchronically in pods/containers' lifecycles. However, the error handling described above will limit the implementation of our NRI plugin:

The CRI runtime does not handle errors from some hook events. So the plugin needs to remember the failure and retry later by itself. e.g. When a plugin wants to remove/uninstall devices in the RemoveContainer event.
There is no configurable failure policy, so the plugins should cooperate nicely to avoid accidental failure, which aborts other plugins' execution. e.g. When a plugin P0 throws errors, the plugin P1 that runs after P0 is always skipped.

Therefore, we hope that there is an error-handling policy in NRI to resolve our issues:

It is expected to be configurable in the NRI adaptation or in the plugin to determine whether the execution error of a plugin can be ignored. For example, provide an option field failureIgnoredStages in the NRI adaptation or stub which defines whose failure can be ignored otherwise the adaptation and the runtime should handle the error and decide whether to fail the CRI request.
It is expected to be perceivable to the CRI runtime (perhaps a future work). So the CRI runtime can behave to handle the error in CRI requests or ignore it.

The error handling is built on top of several design principles of hooks in the container lifecycle:

if NRI hook is part of synchronous event: e.g. StartContainer, the error returned by the NRI plugin is ending chain of plugins to be executed and propogated to upper layers (e.g. to CRI->kubelet). those are about scenarios where request from upper layers can't be completed successfully and in example above container can't be started.
if hook is done in lifecycle event that is asynchronous (e.g. container post-stop), those errors even if they are raised by plugins, we can't do anything about them anyway, container is gone.
if some of the hook is not implemented in plugin: ttrpc will return "not implemented", so inside runtime adaptaion we can ignore that, as plugin is "not interested" in the corresponding hook.

Based on that, plugins should be designed in the following in mind:

plugin can subscribe/implement hooks based on events that plugin is interested in.
if plugin reached the state which is fatal, and operation must be prevented from successful completition -> plugin should return error to synchronous hooks.
if plugin encountered some non-fatal error (e.g. it will be tyring to modify container in background) -> synchronous hook should return success.

/cc @klihub

containerd / nri

Feature Request: support error handling policy #45