Limit ingress traffic to EKS-hosted workloads to cloud.gov egress IP ranges by default

mogul commented 3 years ago

User Story

In order to meet the intent of SC-7 (and sub-controls), EKS clusters provisioned by the SSB should restrict inbound traffic by default, and allow by exception/explicit configuration.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Inbound:

[x] GIVEN I have provisioned an EKS cluster through the SSB \ AND I did not configure allowed input CIDRs during provision \ WHEN I try to visit a service hosted in the cluster from outside of it \ THEN the traffic is blocked.
[x] GIVEN I have provisioned an EKS cluster through the SSB \ AND I configure allowed input CIDRs during provision \ WHEN I try to visit a service hosted in the cluster from an allowed IP range \ THEN the traffic is allowed.

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed] We can restrict inbound traffic to just those ranges configured at provisioning (or none, if none were provided) to meet the intent of this control for the SSB's managed boundary.

Security Considerations (required)

This change enables uses of the EKS broker to lock down their instance for access only from known hosts, eg bastion boxes.

Sketch

For the purpose of this issue, simply limiting access at the external load-balancer is sufficient to meet the ACs and NIST control requirements.

Further issues would enable finer-grained ingress control by:

implementing a default-deny NetworkPolicy in each provisioned namespace which can be edited by the namespace-admin (to open up access for individual workloads, within to the restrictions imposed by the CIDRs specified at provisioning time if any)

mogul commented 3 years ago

We want Terraform to ignore changes that are made to the network policy after it's created and before it's destroyed. We can use Terraform's ignore-changes option in thelifecycle` meta-argument to implement that.

mogul commented 3 years ago

Use a Terraform kubernetes_network_policy to create/destroy the actual policy in the provisioned namespace.

nickumia-reisys commented 3 years ago

Through a lot of research, it was determined that a network policy can't be applied without a network plugin that activates the policy creating the iptables rules. While we were able to get a local standalone version of network policy running, this cannot be translated to our deployment as yet. This is due to the lack of support for daemon sets in Fargate which is necessary for the network plugin to run.

A future solution is to add a managed node (outside of Fargate) to implement the network policy on which would then communicate to the Fargate cluster and faciliate network policy there as well. However, this introduces various compliance-related considerations.

For now, we will be implementing security groups for VPC which will work to restrict traffic within the cluster in a coarser way.

nickumia-reisys commented 3 years ago

There are two main ways to implement rules in a VPC: (1) Security Groups, (2) Network ACL Rules.

For the EKS deployment, there are a few Security Groups that are created (the default one, ALB ones, cluster ones, node ones, ...) I edited the default Security Group to deny all egress; however, since other security groups allowed egress traffic, this was not enough. I did not want to reconfigure the other security groups because it would be a lot of overhead to manage all of them.

The solution was to implement Network ACL rules in the default Network ACL allowing specified ingress/egress IP ranges. After the allowed ranges, another rule is added to deny all other traffic.

Note: There are default ACL rules that are created beforehand, these cannot be modified by Terraform since Terraform does not know about them.

nickumia-reisys commented 3 years ago

Shifting gears ... (again) ... at least a little bit.

The VPC module that was being used seemed to be an older, (less official?) one. The new one seems to be more supported and possbily better supported.

I upgraded to use the new source and testing if it is able to implement security groups in a more meaningful way. Specifically, I'd like to implement some of the concepts taken from this article as it seems like the worker nodes can be restricted to only intra-net traffic, allowing it to communicate with the control plane through a private network, thus denying it public access. This would help with the default deny of egress traffic from worker nodes.

There's quite a bit of documentation related to setting up EKS on private clusters,

Will update with further findings.

nickumia-reisys commented 3 years ago

Most recent findings.

mogul commented 2 years ago

Note on where we are: We talked to our compliance folks and we're going to shift toward using managed node groups in a limited capacity to support our use of the AWS-provided CNI plugin. That's a paved path compared to what we've been doing, but it does mean there will be compliance impact. If ,after reviewing our SSP, that effort looks like it's going to end up being large, we'll split it out into a separate issue.

nickumia-reisys commented 2 years ago

I'm stating this issue as blocked for right now because I need a pair on it. The current situation,

VPC has a very generic configuration to be publicly available and performs no restriction of traffic (egress/ingress)
EKS has a generic configuration as well with a subtlety of containing both fargate nodes and a single managed node.
EKS CNI addon is enabled and does not have any health issues

With the addition of the managed node, the EKS CNI should be able to allow network policies. According to documentation, the CNI does not enable network policies by default. A controller (such as Calico) is still needed to put the policies into effect. There are two examples with Calico and neither seem to work. Kyverno was also explored and it failed.

Calico:
- Installation (standalone): https://docs.projectcalico.org/getting-started/kubernetes/quickstart
- Installation (eks): https://docs.projectcalico.org/getting-started/kubernetes/managed-public-cloud/eks
- Installation (aws): https://docs.aws.amazon.com/eks/latest/userguide/calico.html
- General Idea: Calico coordinates the networking infrastructure changes necessary for kubernetes network policies to take effect. There are also Calico-specific network policies and it's not clear if both work out-of-the-box.
- Issue 1: Pods are created, but policies don't take effect.
- Issue 2: The desired number of calico-nodes are never reached and policies don't take effect either.
Kyverno:
- Installation (yaml): https://kyverno.io/docs/installation/#install-kyverno-using-yamls
- Installation (helm): https://kyverno.io/docs/installation/#install-kyverno-using-helm
- General Idea: Kynerno manages policies to create/update/change kubernetes network policies.
- Issue: Pod keeps crashing on start. There may be an issue with some certificate configuration.

nickumia-reisys commented 2 years ago

Here's a possible good example to follow. It's a Terraform module that deploys calico from a helm chart. The module itself only supports Terraform 0.12. However, it only creates a single helm resource that may be easy to replicate.

Upon initial testing, the helm install was unsuccessful through terraform, but succeeded manually. The nodes still wouldn't become available out-of-the-box.

resource "helm_release" "calico" {
  name = "calico"
  chart = "aws-calico"
  version = "0.2.0"
  repository = "https://lablabs.github.io/eks-charts/"
  namespace  = "kube-system"

  set {
    name = "calico_version"
    value = "v3.8.1"
  }
  set {
    name = "calico_image"
    value = "quay.io/calico/node"
  }
  set {
    name = "typha_image"
    value = "quay.io/calico/typha"
  }
  set {
    name = "service_account_create"
    value = true
  }
}

helm upgrade --install calico aws-calico --repo https://lablabs.github.io/eks-charts/ --namespace kube-system --version 0.2.0 --set={calico_version=v3.8.1} --set={calico_image=quay.io/calico/node} --set={typha_image=quay.io/calico/typha} --set={service_account_create=true}

Optionally, use the --debug flag to diagnose issues.

mogul commented 2 years ago

Noting for future reference: It's also possible to limit client IPs at the NLB.

mogul commented 2 years ago

To make things simpler in the near-term, I'm limiting the scope of this issue back down to just restricting full-cluster ingress to CIDRs at the load-balancer, as described in the post above and tested in the ACs. I'll break the NetworkPolicy and host-based access control options into separate issues documenting those as potential finer-grained features.

nickumia-reisys commented 2 years ago

Calico is already installed, so using a k8s network policy to implement this.

mogul commented 2 years ago

Calico is already installed, so using a k8s network policy to implement this.

That totally makes sense, particularly if we just add this to the default policy we install. Then people can create additional policies or edit that one to open up to additional ranges as warranted.

GSA / data.gov