GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
627 stars 99 forks source link

Limit ingress traffic to EKS-hosted workloads to cloud.gov egress IP ranges by default #3355

Closed mogul closed 2 years ago

mogul commented 3 years ago

User Story

In order to meet the intent of SC-7 (and sub-controls), EKS clusters provisioned by the SSB should restrict inbound traffic by default, and allow by exception/explicit configuration.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Inbound:

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed] We can restrict inbound traffic to just those ranges configured at provisioning (or none, if none were provided) to meet the intent of this control for the SSB's managed boundary.

Security Considerations (required)

This change enables uses of the EKS broker to lock down their instance for access only from known hosts, eg bastion boxes.

Sketch

For the purpose of this issue, simply limiting access at the external load-balancer is sufficient to meet the ACs and NIST control requirements.

Further issues would enable finer-grained ingress control by:

mogul commented 3 years ago

We want Terraform to ignore changes that are made to the network policy after it's created and before it's destroyed. We can use Terraform's ignore-changes option in thelifecycle` meta-argument to implement that.

mogul commented 3 years ago

Use a Terraform kubernetes_network_policy to create/destroy the actual policy in the provisioned namespace.

nickumia-reisys commented 3 years ago

Through a lot of research, it was determined that a network policy can't be applied without a network plugin that activates the policy creating the iptables rules. While we were able to get a local standalone version of network policy running, this cannot be translated to our deployment as yet. This is due to the lack of support for daemon sets in Fargate which is necessary for the network plugin to run.

A future solution is to add a managed node (outside of Fargate) to implement the network policy on which would then communicate to the Fargate cluster and faciliate network policy there as well. However, this introduces various compliance-related considerations.

For now, we will be implementing security groups for VPC which will work to restrict traffic within the cluster in a coarser way.

nickumia-reisys commented 3 years ago

There are two main ways to implement rules in a VPC: (1) Security Groups, (2) Network ACL Rules.

For the EKS deployment, there are a few Security Groups that are created (the default one, ALB ones, cluster ones, node ones, ...) I edited the default Security Group to deny all egress; however, since other security groups allowed egress traffic, this was not enough. I did not want to reconfigure the other security groups because it would be a lot of overhead to manage all of them.

The solution was to implement Network ACL rules in the default Network ACL allowing specified ingress/egress IP ranges. After the allowed ranges, another rule is added to deny all other traffic.

Note: There are default ACL rules that are created beforehand, these cannot be modified by Terraform since Terraform does not know about them.

nickumia-reisys commented 3 years ago

Shifting gears ... (again) ... at least a little bit.

The VPC module that was being used seemed to be an older, (less official?) one. The new one seems to be more supported and possbily better supported.

I upgraded to use the new source and testing if it is able to implement security groups in a more meaningful way. Specifically, I'd like to implement some of the concepts taken from this article as it seems like the worker nodes can be restricted to only intra-net traffic, allowing it to communicate with the control plane through a private network, thus denying it public access. This would help with the default deny of egress traffic from worker nodes.

There's quite a bit of documentation related to setting up EKS on private clusters,

Will update with further findings.

nickumia-reisys commented 3 years ago

Most recent findings.

mogul commented 2 years ago

Note on where we are: We talked to our compliance folks and we're going to shift toward using managed node groups in a limited capacity to support our use of the AWS-provided CNI plugin. That's a paved path compared to what we've been doing, but it does mean there will be compliance impact. If ,after reviewing our SSP, that effort looks like it's going to end up being large, we'll split it out into a separate issue.

nickumia-reisys commented 2 years ago

I'm stating this issue as blocked for right now because I need a pair on it. The current situation,

With the addition of the managed node, the EKS CNI should be able to allow network policies. According to documentation, the CNI does not enable network policies by default. A controller (such as Calico) is still needed to put the policies into effect. There are two examples with Calico and neither seem to work. Kyverno was also explored and it failed.

nickumia-reisys commented 2 years ago

Here's a possible good example to follow. It's a Terraform module that deploys calico from a helm chart. The module itself only supports Terraform 0.12. However, it only creates a single helm resource that may be easy to replicate.

Upon initial testing, the helm install was unsuccessful through terraform, but succeeded manually. The nodes still wouldn't become available out-of-the-box.

resource "helm_release" "calico" {
  name = "calico"
  chart = "aws-calico"
  version = "0.2.0"
  repository = "https://lablabs.github.io/eks-charts/"
  namespace  = "kube-system"

  set {
    name = "calico_version"
    value = "v3.8.1"
  }
  set {
    name = "calico_image"
    value = "quay.io/calico/node"
  }
  set {
    name = "typha_image"
    value = "quay.io/calico/typha"
  }
  set {
    name = "service_account_create"
    value = true
  }
}
helm upgrade --install calico aws-calico --repo https://lablabs.github.io/eks-charts/ --namespace kube-system --version 0.2.0 --set={calico_version=v3.8.1} --set={calico_image=quay.io/calico/node} --set={typha_image=quay.io/calico/typha} --set={service_account_create=true}

Optionally, use the --debug flag to diagnose issues.

mogul commented 2 years ago

Noting for future reference: It's also possible to limit client IPs at the NLB.

mogul commented 2 years ago

To make things simpler in the near-term, I'm limiting the scope of this issue back down to just restricting full-cluster ingress to CIDRs at the load-balancer, as described in the post above and tested in the ACs. I'll break the NetworkPolicy and host-based access control options into separate issues documenting those as potential finer-grained features.

nickumia-reisys commented 2 years ago

Calico is already installed, so using a k8s network policy to implement this.

mogul commented 2 years ago

Calico is already installed, so using a k8s network policy to implement this.

That totally makes sense, particularly if we just add this to the default policy we install. Then people can create additional policies or edit that one to open up to additional ranges as warranted.