Closed wedaly closed 3 months ago
(apologies, I've been totally slammed, but reviewing this is very high up on my to-do list)
This is a clever solution, but I have concerns about the complicated dance that is Cilium acting as both a CNI plugin and a CNI runtime -- especially the details around lifecycle and making sure we don't leak IPs. CNI GC can save us, to some extent -- but only when runtime (containerd) implements it, which is in progress but not done yet :-).
What would it be like if health and ingress IPs were allocated by an external controller and set as annotations on the Node? Cilium could have a mode where it waits for these annotations before proceeding with startup.
Alternatively, what if we created a "placeholder" daemonset? Then there would be two noop pods scheduled to the node, but then the lifecycle is crystal clear.
Thanks for reviewing this @squeed !
This is a clever solution, but I have concerns about the complicated dance that is Cilium acting as both a CNI plugin and a CNI runtime -- especially the details around lifecycle and making sure we don't leak IPs. CNI GC can save us, to some extent -- but only when runtime (containerd) implements it, which is in progress but not done yet :-).
Yeah, after thinking this through I tend to agree. Even if CNI GC is able to clean up leaked IPs, I think there's a risk of double-allocating IPs if the volume mounts are misconfigured (cilium restarts, CNI IPAM state isn't preserved due to misconfigured volume mounts, and IPAM plugin returns an IP that was already assigned to a pod).
What would it be like if health and ingress IPs were allocated by an external controller and set as annotations on the Node? Cilium could have a mode where it waits for these annotations before proceeding with startup.
That seems feasible, although I'd worry about moving this responsibility to the cloud provider. The nice thing about delegated IPAM today is that Azure IPAM doesn't need to know that it's running Cilium, and Cilium doesn't need to know that it's using Azure IPAM.
In theory I guess someone could write a controller that tells Azure IPAM to allocate an IP, then annotates the nodes; however, there isn't an easy way to do this in AKS today. In most networking configurations, IPAM allocation happens by invoking an endpoint on a daemonset on the node where the IP is allocated; there isn't an API available that a controller could call to allocate an IP for any node.
Alternatively, what if we created a "placeholder" daemonset? Then there would be two noop pods scheduled to the node, but then the lifecycle is crystal clear.
I like this approach better. No-op daemonset pods seem like a reasonable tradeoff to reduce the risk of misconfiguration.
I think there's a cyclic dependency to figure out though: (1) daemonset pod needs cilium-agent running to start and get an IP, (2) cilium-agent needs the IP from the daemonset pod to start cilium-agent. How hard would it be to let cilium-agent start without the IPs? For ingress/healthcheck IPs I guess cilium-agent could set these up later, but I'm less sure about the router IP.
In theory I guess someone could write a controller that tells Azure IPAM to allocate an IP, then annotates the nodes; however, there isn't an easy way to do this in AKS today.
I take that back: it is possible to assign additional IPs to the VM through Azure APIs, which is what Cilium's legacy Azure IPAM implementation does: https://github.com/cilium/cilium/blob/4820ebe7b372977c351837eb1ce5ac6024f712aa/pkg/azure/ipam/instances.go#L23
So that might be an option. It's a bit sad that it's cloud-provider specific, but at least someone could implement this controller outside the Cilium codebase.
What would it be like if health and ingress IPs were allocated by an external controller and set as annotations on the Node?
One other quick thought on this: there's a security risk to granting a controller permission to update k8s nodes, since an attacker could manipulate labels to control where pods get scheduled (https://kccnceu2022.sched.com/event/ytlb/trampoline-pods-node-to-admin-privesc-built-into-popular-k8s-platforms-yuval-avrahami-shaul-ben-hai-palo-alto-networks). Maybe permission to mutate CiliumNode is safer?
@wedaly We now have status for the CFPs https://github.com/cilium/design-cfps#status. Would you consider moving this to dormant so it could be picked up again in the future?
@wedaly We now have status for the CFPs https://github.com/cilium/design-cfps#status. Would you consider moving this to dormant so it could be picked up again in the future?
Yep, that works for me. Where do I set the status?
Something like this is good enough:
https://github.com/cilium/design-cfps/pull/7/commits/66785a1b4f05332a4beb3531b196acf23d1e0533
To be merged into the repository in dormant state, we would need two more things:
To be merged into the repository in dormant state, we would need two more things:
1. Each commit needs to be signed off with DCO. You can see details and what this means in the "checks" section at the bottom of the PR. 2. Mark as "ready for review".
thanks, updated!
I started this draft CFP and prototype for getting delegated IPAM to work with endpoint health checking and ingress controller back in November. Unfortunately, I'm not going to have time to work further on this, so I'm publishing it as a draft PR to share what I found.