cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.26k stars 2.79k forks source link

CFP: Block pod startup until extant egress gateway policies are applied #24791

Open benesch opened 1 year ago

benesch commented 1 year ago

Cilium Feature Proposal

Is your feature request related to a problem?

Pods that are subject to an CiliumEgressGatewayPolicy can make a few requests immediately after pod startup that are not subject to the egress policy.

This problem was previously reported in https://github.com/cilium/cilium/issues/22969 as excessive latency between pod creation and policy application. That issue was solved by optimizing the policy application time; however a small window (1-2s) between pod creation and policy application still exists.

Describe the feature you'd like

We'd like Cilium to block pod startup until all egress gateway policies for all policies that were present at the time of the pod's creation have been applied to the pod.

I've previously filed this request via a support ticket (https://support.isovalent.com/hc/en-us/requests/332; internal enhancement request 520), but I wanted to re-request in public so that I could link this issue at folks.

(Optional) Describe your proposed solution

This is analogous to synchronously waiting for the CiliumEndpoint to be created when a pod is created, which Cilium added support for back in 2018:

Ideally a similar code path could be added for egress gateway policies—i.e., waiting for the first regeneration of egress policies after the new endpoint is created.

Workaround

For anyone else running into this, we've been successfully using the following workaround at Materialize. We plumb the list of known egress gateway IPs into an init container, and then block until curl https://checkip.amazonaws.com reports that the traffic is going out one of those egress IPs:

while [[ " ${KNOWN_EGRESS_IPS[*]} " =~ " $(curl https://checkip.amazonaws.com) " ]]; do
    sleep 0.5
done

Silly, but it works.

MrFreezeex commented 1 year ago

I am also interested lately into the correctness of egress gateway (and thus this but probably with less needs than you as my personal cluster where I use egressgateway is super small so this is probably nearly instantaneous in my case anyway). From what I can see (non isovalent/non expert opinion) on the subject this is most likely way harder to implement than the proposed solution you are describing because the BPF maps need to be updated on all the involved nodes (all the selected egress gateways + the client node) for the egress gateway to work correctly.

So it might needs that each node involved with a specified egress gateway report somehow whenever they update the egress maps and that on the CNI call it waits. Not exactly sure what would be the best way to report this though :thinking: (cilium endpoint status maybe?)...

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

benesch commented 1 year ago

Not stale.

On Sat, Jun 10, 2023 at 10:10 PM github-actions[bot] < @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

— Reply to this email directly, view it on GitHub https://github.com/cilium/cilium/issues/24791#issuecomment-1585974263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIEOQJQIWOKORB7UIY3XKUSIRANCNFSM6AAAAAAWYGIF6A . You are receiving this because you authored the thread.Message ID: @.***>

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

benesch commented 10 months ago

Not stale.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

benesch commented 8 months ago

Still relevant.