awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
643 stars 219 forks source link

spark-k8s-operator requires the awscli, which doesn't work on terraform enterprise #513

Closed dacort closed 4 months ago

dacort commented 6 months ago

Description

When running the spark-k8s-operator example on terraform enterprise, the apply fails with the following error.

Error: Kubernetes cluster unreachable: Get "https://<ID>.sk1.us-east-1.eks.amazonaws.com/version": getting credentials: exec: executable aws not found It looks like you are trying to use a client-go credential plugin that is not installed. To learn more about this feature, consult the documentation available at: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#client-go-credential-plugins

After hunting around for a while, I found this issue and realized it was likely a TF cloud/enterprise issue.

That said, the emr-eks-karpenter example doesn't have the same issue as it uses the aws_eks_cluster_auth data source instead of trying to use the AWS CLI. It'd be great to update the spark-k8s-operator with that.

Versions

Reproduction Code [Required]

Steps to reproduce the behavior:

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

Changing to the same approach that the emr-eks-karpenter example uses succeeds.

vara-bonthu commented 6 months ago

Thanks for raising the issue, @dacort!

You are correct thatemr-eks-karpenter blueprint is using aws_eks_cluster_auth: See here.

And, the Spark Operator blueprint was recently updated to use exec plugin authentication, which is designed to refresh the keys more effectively than the previous approach: See here.

For the exec plugin, you need to install the AWS CLI locally as a prerequisite, as it runs a command locally to fetch the token. This is approach might not work in TFCloud if there is no AWS ClI installed in the TFCloud agent/server. I am happy for you to raise a PR for this or one of us will raise a PR using your issue.

There is ongoing debate in the community about both approaches, and both seem to frequently encounter the issue mentioned here: Authentication Issues with EKS

dacort commented 6 months ago

Ahh, interesting, thanks for the context @vara-bonthu! Unfortunately as you noted, not sure what control I have over the ability to install the CLI in TFCloud. Will look into that.

otterley commented 6 months ago

You can, in fact, install additional tools on the worker instance, using a null_resource resource and a local-exec provisioner. See https://developer.hashicorp.com/terraform/enterprise/run/install-software#installing-additional-tools for details.

Example:

resource "null_resource" "install-aws-cli" {
  provisioner "local-exec" {
    command = "cd /tmp && curl -sSL https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip -o awscliv2.zip && unzip -q awscliv2.zip && sudo ./aws/install"
  }
}

Note: I have not tested the above, so this may not work - kindly let us know here if a different command is required.

dacort commented 6 months ago

Thanks @otterley!

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] commented 4 months ago

Issue closed due to inactivity.