Open bobbypage opened 11 months ago
/cc @klihub
Maybe one approach to consider is for Synchronize to return pod sandboxes creations that are in flight (i.e. don't exclude Unknown state pod sandboxes).
We'll want to trace through and understand other situations that can cause pods to be in StateUnknown. Without looking at the code right now, my understanding is that StateUnknown is also used during containerd restarts.
We might want to publish some best practices for writing NRI plugins (and for general event-based systems), something like:
NRI plugins themselves can crash (or might need to be updated while containerd is running), so they'll need to be able to bootstrap and maintain correct state throughout their lifecycles.
I need to check if it is possible to differentiate between a transient unknown state (pod being created) and non-transient ones. If it is possible then in principle we can try to be more/correctly selective about filtering pods in unknown state during plugin synchronization.
However, I think that wouldn't be enough to fully solve this problem. Even if the plugin synchronization would relay pods in such a state, the pod information relayed would be incomplete. Since there are no post-* events for pods this would not get corrected from the plugins point of view until the next (pod or container) event involving the same pod occurs. So then we'd also need to account for this internally and take some corrective measures at the end of pod creation once it exits the transient state/gets created.
I suspect that an easier/simpler alternative could be to block plugin registration while a pod is being created.
Just my $0.02: I think it is good not to filter out anything. If we have knowledge of Pod, but it is in inconsistent state, send it to plugin during Sync, but clearly state status Unknown
or something similar, so plugin can decide what to do with it.
Regarding post-* events, I think we reached the point where we have use cases for hooks in Pod lifecycle. In initial implementation we focused on containers lifecycle, now we can look at more details of Pod properties. As well, it would can also help with reconcilation of those Unknown
state of the pods: in initial sync it will be in unknown state, and once it is properly started, post-pod-create event will be delivered to plugin with correct state/info.
Delaying registering of the plugins might be not the best scenario: pod creation might be stuck for significant amount of time, thus might start fail liveness/readyness probe for NRI plugin (if deployed via DaemonSet).
We have a plugin that monitors for
RunPodSandbox
events. We observed that if aRunPodSandbox
requests is in flight while the NRI plugin starts up and registers, then the pod sandbox event will be missed and not delivered inSynchronize
orRunPodSandbox
.Here's the timeline:
RunPodSandbox
creation eventRunPodSandbox
, and creates a pod sandbox in sandboxstore.StateUnknownRunPodSandbox
NRI event (because no NRI plugin is registered just yet)sandboxstore.StateUnknown
RunPodSandbox
completesRunPodSandbox
event was missed from bothSynchronize
call andRunPodSandbox
NRI events!Expected behavior:
I would expect that for every pod sandbox event, it will be delivered in either
Synchronize
orRunPodSandbox
. Maybe one approach to consider is forSynchronize
to return pod sandboxes creations that are in flight (i.e. don't exclude Unknown state pod sandboxes).