containerd / nri

Node Resource Interface
Apache License 2.0
257 stars 65 forks source link

Synchronize() can miss pod sandboxes that are in the process of being created, leading to missing PodSandbox events #63

Open bobbypage opened 11 months ago

bobbypage commented 11 months ago

We have a plugin that monitors for RunPodSandbox events. We observed that if a RunPodSandbox requests is in flight while the NRI plugin starts up and registers, then the pod sandbox event will be missed and not delivered in Synchronize or RunPodSandbox.

Here's the timeline:

  1. Kubelet issues a RunPodSandbox creation event
  2. Containerd starts to process the RunPodSandbox, and creates a pod sandbox in sandboxstore.StateUnknown
  3. Containerd doesn't send a RunPodSandbox NRI event (because no NRI plugin is registered just yet)
  4. NRI Plugin Starts up & Registers
  5. containerd registers the plugin and synchronizes it's state. As part of doing so, it list all the pod sandboxes, but note it filters out sandboxes in sandboxstore.StateUnknown
  6. The NRI plugin recevies the synchronized list of PodSandboxes, but it misses the pod in (1) because the sandbox was in Unknown state
  7. The RunPodSandbox completes
  8. The RunPodSandbox event was missed from both Synchronize call and RunPodSandbox NRI events!

Expected behavior:

I would expect that for every pod sandbox event, it will be delivered in either Synchronize or RunPodSandbox. Maybe one approach to consider is for Synchronize to return pod sandboxes creations that are in flight (i.e. don't exclude Unknown state pod sandboxes).

kad commented 11 months ago

/cc @klihub

samuelkarp commented 11 months ago

Maybe one approach to consider is for Synchronize to return pod sandboxes creations that are in flight (i.e. don't exclude Unknown state pod sandboxes).

We'll want to trace through and understand other situations that can cause pods to be in StateUnknown. Without looking at the code right now, my understanding is that StateUnknown is also used during containerd restarts.

We might want to publish some best practices for writing NRI plugins (and for general event-based systems), something like:

  1. Ensure you subscribe to the event channel before initial synchronization, and process all events (to catch new events that occur during subsequent synchronization)
  2. Perform an initial synchronization
  3. Perform a periodic reconciliation

NRI plugins themselves can crash (or might need to be updated while containerd is running), so they'll need to be able to bootstrap and maintain correct state throughout their lifecycles.

klihub commented 11 months ago

I need to check if it is possible to differentiate between a transient unknown state (pod being created) and non-transient ones. If it is possible then in principle we can try to be more/correctly selective about filtering pods in unknown state during plugin synchronization.

However, I think that wouldn't be enough to fully solve this problem. Even if the plugin synchronization would relay pods in such a state, the pod information relayed would be incomplete. Since there are no post-* events for pods this would not get corrected from the plugins point of view until the next (pod or container) event involving the same pod occurs. So then we'd also need to account for this internally and take some corrective measures at the end of pod creation once it exits the transient state/gets created.

I suspect that an easier/simpler alternative could be to block plugin registration while a pod is being created.

kad commented 11 months ago

Just my $0.02: I think it is good not to filter out anything. If we have knowledge of Pod, but it is in inconsistent state, send it to plugin during Sync, but clearly state status Unknown or something similar, so plugin can decide what to do with it. Regarding post-* events, I think we reached the point where we have use cases for hooks in Pod lifecycle. In initial implementation we focused on containers lifecycle, now we can look at more details of Pod properties. As well, it would can also help with reconcilation of those Unknown state of the pods: in initial sync it will be in unknown state, and once it is properly started, post-pod-create event will be delivered to plugin with correct state/info.

Delaying registering of the plugins might be not the best scenario: pod creation might be stuck for significant amount of time, thus might start fail liveness/readyness probe for NRI plugin (if deployed via DaemonSet).