cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.24k stars 2.79k forks source link

CFP: CiliumNetworkPolicy validation #23152

Open giorio94 opened 1 year ago

giorio94 commented 1 year ago

Cilium Feature Proposal

Is your feature request related to a problem?

Malformed/invalid CNPs (e.g., L7 policies without the protocol: TCP field set) get silently discarded by the agent, emitting only the corresponding log entry about the error encountered during validation (e.g., L7 rules can only apply to TCP (not ANY) except for DNS rules). For non-experts, this makes it tricky to understand why the CNP is not being enforced as expected, as no errors are reported while creating the resource, nor they are attached to it.

Describe the feature you'd like

In my opinion, a great QoL addition would be the introduction of a validation webhook to prevent the creation of invalid network policies, making it immediately clear to the user that something is wrong with that.

Alternatively, the feedback might be returned as part of the status stanza of the resource, or emitted as a Kubernetes event, which are typically looked for before the logs.

christarazi commented 1 year ago

Relevant K8s documentation: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-configuration

One thing that is not entirely clear to me is what should be the clientConfig configuration. Specifying the URL has localhost would be an obvious first choice, as Cilium runs as a daemonset so it will be available on every node, however we can't assume that the kube-apiserver is always running within the cluster on a node(s).

We need to somehow specify a destination API endpoint (i.e. http://host:port...) in a manner where we don't assume the location of the kube-apiserver. Specifying a service could potentially work. The Operator could create the service and receive / respond to the validation requests. This seems to be a better approach than having each Cilium Agent handle this.

squeed commented 1 year ago

Running webhooks, especially for network providers, can be kind of a pain:

as a first pass, what if we wrote CEL rules for our types?

1: this one has bitten me really hard in the past; it turns out some managed control plane offerings have incomplete connectivity between the apiserver and the underlying nodes. Connections to the pod network are fine, but not necessarily to nodes directly

christarazi commented 1 year ago

1: this one has bitten me really hard in the past; it turns out some managed control plane offerings have incomplete connectivity between the apiserver and the underlying nodes. Connections to the pod network are fine, but not necessarily to nodes directly

But then how do the kubelet healthchecks work? Aren't they apiserver-to-node traffic too?

christarazi commented 1 year ago

as a first pass, what if we wrote CEL rules for our types?

But immutability is not validation? How does this help?

squeed commented 1 year ago

Whoops, I pasted the wrong blog link. CEL is a general-purpose validation language. See https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/ — sorry

christarazi commented 1 year ago

Ah great, yes this looks very promising. I didn't realize this sort of complex validation expressions have landed. We've done a lot of work a few years ago to get our CRDs statically validated, but it hasn't been touched since then, so it seems we should be able to switch some (hopefully all) fields to CEL expressions. This might even allow us to get rid of our nasty fork of controller-tools.

christarazi commented 1 year ago

cc @aanm ^ :slightly_smiling_face:

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

danehans commented 9 months ago

CRD validation rules graduated to Beta in k8s v1.25 and the CustomResourceValidationExpressions feature gate is enabled by default to validate custom resource based on validation rules.

Note that CustomResourceValidationExpressions can address the use case described in this issue but not the use case described in https://github.com/cilium/cilium/issues/28080. This is because --enable-l7-proxy=true|false is not reflected in the CiliumNetworkPolicy CRD and all validation rules are scoped to the current object: no cross-object or stateful validation rules are supported. cc @joestringer

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

squeed commented 1 month ago

So, @sayboras has been looking in to this briefly, but a lot more help is required. However, there is one potential stumbling block: the CEL complexity analyzer will need us to set maxItems or equivalent on a number of lists.

This could potentially affect existing policies, as there are currently no limitations.

That means we will need to do a controlled rollout of these limitations, and warn users in the most helpful way possible.

One option may be via https://github.com/cilium/cilium/pull/32727, which is a PR that sets an informational status on objects. Additionally, we may wish to expose this via a metric and write useful Prometheus alerts. Once these limitations are rolled out and end-users have an opportunity to adjust policies, we can add hard limits.