cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
20.05k stars 2.95k forks source link

CFP: Prioritization of CES updates for quicker enforcement of Network Policy in critical namespaces #34203

Open Kaczyniec opened 2 months ago

Kaczyniec commented 2 months ago

Cilium Feature Proposal

Feature Proposal doc:(https://docs.google.com/document/d/13yYuMYrmhQB1ppnM9oYVO9yq5ga6qN2GBTlU4lFaPD4/edit#heading=h.5mwz58y9syps)

Problem

When updating Pods, the Cilium Operator transmits information about Pod (IP and security identity) through CEP and CES updates across the network. The Controller in Cilium Operator has a rate limit on the number of items it can send to the Kube-APIserver per second. As the number of nodes and pod churn rate increases, more information needs to be propagated, leading to significant delays.

Feature proposal

The Cilium Endpoint Slice controller in Cilium Operator can prioritize certain Cilium Endpoint Slices (CES) and Cilium Endpoints (CEP) by sending information about their updates first to the Kube-APIserver. This prioritization aims to accelerate the propagation of critical changes during Network Policy updates and allows to enforce updated network policy quicker in more critical namespaces.

Proposed solution

To accelerate the propagation of critical updates, we propose classifying updates as "important" or "standard" based on the namespace to which the CEP and CES belong. A watcher watches over namespace events. If the modified namespace’s attribute cilium.io/ces-namespace has value “priority”, the namespace is added to the map of prioritized namespaces in the controller. If the attribute value is different or namespace is deleted, the namespace is deleted from the map of prioritized namespaces (if it was present). The controller will utilize two queues instead of one: standard_queue and fast_queue. Before adding an item to a queue, the controller will check its namespace against the prioritized map. If it matches, the item is added to the fast_queue; otherwise, it goes to the standard_queue. When processing the next work item, the controller will first check the fast_queue. If it's not empty, the next element will be retrieved from there. If the fast_queue is empty, the controller will process elements from the standard_queue. The implementation of the queues, including error handling, remains unchanged from the single-queue approach. It is recommended that the user annotates as priority the namespaces, whose information must be propagated to enable traffic. For example, consider pod X in namespace A, which handles essential production traffic within the customer's cluster. If a network policy allows pod X to connect to pod Y in a namespace B, the namespace B should be annotated as priority. This will allow for quicker propagation of the changes in it and allow pod X to resume traffic quicker.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.