Open damianavila opened 2 years ago
Push the conversation forward but do not block the existing deployment.
EKS basically doesn't offer any built-in networkpolicy enforcement, and their examples (https://github.com/aws/amazon-vpc-cni-k8s/tree/master/charts/aws-calico) are deprecated too. In https://docs.aws.amazon.com/eks/latest/userguide/calico.html, they recommend using https://github.com/tigera/operator - which IMO is pretty complicated for what we need to do. I'm sure there are other alternatives, but I'd suggest we revsiti this in Q1 of next year.
but I'd suggest we revsiti this in Q1 of next year.
Makes sense to me given all the other priorities in place right now.
As additional context, Calico (what AWS suggests) is also what is used by GCP to provide networkpolicy enforcement, so using the same will help us here in avoiding excess complexity. However, we currently don't run any operators in any of our clusters, prefering direct helm charts instead. So that would have to be new, unless there also exists a calico helm chart (not the operator helm chart)
Investigating why some clusters requested more memory than others, I found that the aws-node
daemonset on some clusters (likeley our most recently created clusters) have pods with an additional container related to enforcing network policies.
The checked clusters have network policy enforcement.
A verified way to test network policy enforcement related to local IPs is wget support-grafana.support.svc
, where you will see Connecting to support-grafana.support.svc (support-grafana.support.svc)|10.3.250.43|:80... failed: Connection timed out.
if it works.
Below is an example where I saw that first, and then when I deleted the singleuser
netpol affecting the user pod in a GKE cluster, it started working.
jovyan@jupyter-erik-402i2c-2eorg:~$ wget support-grafana.support.svc
--2024-09-10 11:26:02-- http://support-grafana.support.svc/
Resolving support-grafana.support.svc (support-grafana.support.svc)... 10.3.250.43
Connecting to support-grafana.support.svc (support-grafana.support.svc)|10.3.250.43|:80... failed: Connection timed out.
Retrying.
--2024-09-10 11:28:13-- (try: 2) http://support-grafana.support.svc/
Connecting to support-grafana.support.svc (support-grafana.support.svc)|10.3.250.43|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: /login [following]
--2024-09-10 11:28:29-- http://support-grafana.support.svc/login
Reusing existing connection to support-grafana.support.svc:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’
index.html [ <=> ] 37.42K --.-KB/s in 0s
2024-09-10 11:28:29 (112 MB/s) - ‘index.html’ saved [38318]
Setting up nmfs-openscapes with modern addon setup where network policy could be enabled, I didn't get them to enforce network policies in practice, but I saw no indication that it wasn't enabled config wise etc.
So the current status is that the network policy enforcement using the Amazon VPC CNI addon has been tried without success so far also in new clusters with modern addon versions, but not debugged at length.
Context
This is a "soft" pre-requisite for being able to deploy a shared cluster in AWS land.
We currently enforce it in the GCP clusters, AFAIK: https://infrastructure.2i2c.org/en/latest/topic/cluster-design.html#network-policy.
And @consideRatio confirmed we are not enforcing them on the EKS clusters (copying over from notes):
Proposal
We may want to enforce it from scratch instead of installing it on existing clusters. There might be an easy? terraform way to set up calico/cilium on EKS-based nodes.
Updates and actions