NetworkPolicy enforcement in AWS based (eksctl) clusters

damianavila commented 1 year ago

Context

This is a "soft" pre-requisite for being able to deploy a shared cluster in AWS land.

We currently enforce it in the GCP clusters, AFAIK: https://infrastructure.2i2c.org/en/latest/topic/cluster-design.html#network-policy.

And @consideRatio confirmed we are not enforcing them on the EKS clusters (copying over from notes):

Practical test by Erik on JMTE’s cluster - they are not enforced, there is no NetworkPolicy controller there by default.
I tested this practically using https://github.com/jupyterhub/action-k3s-helm/tree/main/test-netpol-enforcement where I did the following to run the practical test

cd /tmp
git clone https://github.com/jupyterhub/action-k3s-helm/tree/main/test-netpol-enforcement
cd action-k3s-helm/test-netpol-enforcement
helm install –namespace=test test .
helm test –namespace=test test

To conclude, a way to install a network policy controller will be relevant.
Maybe like this with calico: https://sbulav.github.io/aws/aws-calico-implementing-on-existing-cluster/
Maybe like this with cilium: https://gist.github.com/ruzickap/6cd45d7d4b97fb6dfc27d1a9a4af848f#file-eksctl-cilium-sh-L43-L51
Overall, the ways of installing a network policy controller seems troublesome and fragile as they seem to replace key parts that otherwise would have been managed for us.

Proposal

Overall, the ways of installing a network policy controller seems troublesome and fragile as they seem to replace key parts that otherwise would have been managed for us.

We may want to enforce it from scratch instead of installing it on existing clusters. There might be an easy? terraform way to set up calico/cilium on EKS-based nodes.

Updates and actions

[x] #1822

damianavila commented 1 year ago

Push the conversation forward but do not block the existing deployment.

yuvipanda commented 1 year ago

EKS basically doesn't offer any built-in networkpolicy enforcement, and their examples (https://github.com/aws/amazon-vpc-cni-k8s/tree/master/charts/aws-calico) are deprecated too. In https://docs.aws.amazon.com/eks/latest/userguide/calico.html, they recommend using https://github.com/tigera/operator - which IMO is pretty complicated for what we need to do. I'm sure there are other alternatives, but I'd suggest we revsiti this in Q1 of next year.

damianavila commented 1 year ago

but I'd suggest we revsiti this in Q1 of next year.

Makes sense to me given all the other priorities in place right now.

yuvipanda commented 1 year ago

As additional context, Calico (what AWS suggests) is also what is used by GCP to provide networkpolicy enforcement, so using the same will help us here in avoiding excess complexity. However, we currently don't run any operators in any of our clusters, prefering direct helm charts instead. So that would have to be new, unless there also exists a calico helm chart (not the operator helm chart)

damianavila commented 1 year ago

Additional info: https://github.com/aws/eks-charts/tree/master/stable/aws-calico. One more: https://github.com/projectcalico/calico/tree/master/charts/tigera-operator.

consideRatio commented 4 months ago

Investigating why some clusters requested more memory than others, I found that the aws-node daemonset on some clusters (likeley our most recently created clusters) have pods with an additional container related to enforcing network policies.

The checked clusters have network policy enforcement.

[x] bican
[x] dandi
[x] jupyter-health
[x] kitware
[x] linc
[ ] 2i2c-aws-us
[ ] catalystproject-africa
[ ] earthscope
[ ] gridsst
[ ] jupyter-meets-the-earth
[ ] nasa-cryo
[ ] nasa-esdis
[ ] nasa-ghg
[ ] nasa-veda
[ ] openscapes
[ ] opensci
[ ] smithsonian
[ ] ubc-eoas
[ ] victor

consideRatio commented 2 weeks ago

A verified way to test network policy enforcement related to local IPs is wget support-grafana.support.svc, where you will see Connecting to support-grafana.support.svc (support-grafana.support.svc)|10.3.250.43|:80... failed: Connection timed out. if it works.

Below is an example where I saw that first, and then when I deleted the singleuser netpol affecting the user pod in a GKE cluster, it started working.

jovyan@jupyter-erik-402i2c-2eorg:~$ wget support-grafana.support.svc
--2024-09-10 11:26:02--  http://support-grafana.support.svc/
Resolving support-grafana.support.svc (support-grafana.support.svc)... 10.3.250.43
Connecting to support-grafana.support.svc (support-grafana.support.svc)|10.3.250.43|:80... failed: Connection timed out.
Retrying.

--2024-09-10 11:28:13--  (try: 2)  http://support-grafana.support.svc/
Connecting to support-grafana.support.svc (support-grafana.support.svc)|10.3.250.43|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: /login [following]
--2024-09-10 11:28:29--  http://support-grafana.support.svc/login
Reusing existing connection to support-grafana.support.svc:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

index.html                                                                      [ <=>                                                                                                                                                                                        ]  37.42K  --.-KB/s    in 0s      

2024-09-10 11:28:29 (112 MB/s) - ‘index.html’ saved [38318]

consideRatio commented 2 weeks ago

Setting up nmfs-openscapes with modern addon setup where network policy could be enabled, I didn't get them to enforce network policies in practice, but I saw no indication that it wasn't enabled config wise etc.

So the current status is that the network policy enforcement using the Amazon VPC CNI addon has been tried without success so far also in new clusters with modern addon versions, but not debugged at length.

2i2c-org / infrastructure