awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
663 stars 225 forks source link

Granting S3 access to karpenter nodes #338

Open elyall opened 1 year ago

elyall commented 1 year ago

Please describe your question here

Thanks for the great examples! I altered the jupyterhub on eks example (for a private cluster accessed via a Tailscale VPN) and I'm now adding a ray cluster and trying to grant S3 access to the jobs running on karpenter nodes. I was trying to use the same karpenter provisioners but how do I grant the jobs S3 access?

Provide a link to the example/module related to the question

jupyterhub ray

Additional context

I may just follow the ray example and generate karpenter resources outside of aws-ia/eks-blueprints-addons/aws.

Also here's the policy I'm attaching:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Download",
            "Effect": "Allow",
            "Action": [
                "s3:List*",
                "s3:Get*"
            ],
            "Resource": [
                "${bucket_arn}",
                "${bucket_arn}/*"
            ]
        },
        {
            "Sid": "Decrypt",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "${kms_key_arn}"
            ]
        }
    ]
}

And the error I'm getting is:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
askulkarni2 commented 1 year ago

Thanks for the great examples! I altered the jupyterhub on eks example (for a private cluster accessed via a Tailscale VPN) and I'm now adding a ray cluster and trying to grant S3 access to the jobs running on karpenter nodes.

Amazing! Tailscale is a great addition.

While we did show S3 access via the karpenter module's iam_role_additional_policies IIRC it was done so because RayCluster helm chart at the time did not have support for specifying a serviceAccountName as value. I just looked and it seems they have now added support for it, i.e. you can now specify head.serviceAccountName and worker.serviceAccountName as helm values which is great. This enables us to use an IAM Roles for Service Accounts which is demonstrated in the JupyterHub example. This would be the preferred away over the karpenter node role as this would restrict access to S3 buckets only to the RayCluster pods.

I will update the ray blueprint soon but you can go about it on your own as it is shown for the JupyterHub example here ... https://github.com/awslabs/data-on-eks/blob/44cb0769afc752e57bfb2d11192ebcec1ce97389/ai-ml/jupyterhub/jupyterhub.tf#L4-L48

Then use this serviceAccountName as value in the RayCluster helm chart (if you are using helm) or directly in RayCluster.yaml.

Optionally you can use our aws-ia/eks-blueprints-addons/aws module which we have provided as a convenience if you want to avoid some of the boiler plate code to create the helm_release and IRSA in a single shot (this is what I will use).

HTH, and please let us know if you run into any issues.

elyall commented 1 year ago

This sounds fantastic! When I tried implementing it I get the following error:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity

I created an additional policy and attached it to the ray_single_user_irsa but it results in the same error. Am I attaching it to the wrong role?

Here's my code: ``` resource "kubernetes_namespace" "ray" { metadata { name = "ray" } } module "ray_single_user_irsa" { source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks" role_name = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user-sa" role_policy_arns = { bucket1_get_policy = bucket1_get_policy_arn bucket2_get_policy = bucket2_get_policy_arn sts_policy = module.ray_policy.arn } oidc_providers = { main = { provider_arn = data.terraform_remote_state.eks.outputs.oidc_provider_arn namespace_service_accounts = ["${kubernetes_namespace.ray.metadata[0].name}:ray-single-user"] } } } resource "kubernetes_service_account_v1" "ray_single_user_sa" { metadata { name = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user" namespace = kubernetes_namespace.ray.metadata[0].name annotations = { "eks.amazonaws.com/role-arn" : module.ray_single_user_irsa.iam_role_arn } } automount_service_account_token = true } resource "kubernetes_secret_v1" "ray_single_user" { metadata { name = "${data.terraform_remote_state.eks.outputs.cluster_name}-ray-single-user-secret" namespace = kubernetes_namespace.ray.metadata[0].name annotations = { "kubernetes.io/service-account.name" = kubernetes_service_account_v1.ray_single_user_sa.metadata[0].name "kubernetes.io/service-account.namespace" = kubernetes_namespace.ray.metadata[0].name } } type = "kubernetes.io/service-account-token" } module "ray_policy" { source = "terraform-aws-modules/iam/aws//modules/iam-policy" version = "~> 5.20" name = "RayPolicy" description = "IAM Policy to allow ray to function" policy = jsonencode( { Version = "2012-10-17" Statement = [ { Sid = "AssumeRoleWithWebIdentity" Effect = "Allow" Action = ["sts:AssumeRoleWithWebIdentity"] Resource = ["*"] }, ] } ) } ```
The full error: ```shell ray::convert_dataset() (pid=458, ip=100.64.160.5) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/working_dir_files/_ray_pkg_20d44c85bf820c86/rb_analysis/rb/images/ngff.py", line 690, in convert_dataset if output_path.exists(): File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py", line 389, in exists return self.client._exists(self) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 179, in _exists return self._s3_file_query(cloud_path) is not None File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 197, in _s3_file_query return next( File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 198, in ( File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/boto3/resources/collection.py", line 81, in __iter__ for page in self.pages(): File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/boto3/resources/collection.py", line 171, in pages for page in pages: File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/paginate.py", line 269, in __iter__ response = self._make_request(current_kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/paginate.py", line 357, in _make_request return self._method(**current_kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call return self._make_api_call(operation_name, kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 959, in _make_api_call http, parsed_response = self._make_request( File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 982, in _make_request return self._endpoint.make_request(operation_model, request_dict) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request return self._send_request(request_dict, operation_model) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request request = self.create_request(request_dict, operation_model) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request self._event_emitter.emit( File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit return self._emitter.emit(aliased_event_name, **kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit return self._emit(event_name, kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit response = handler(**kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler return self.sign(operation_name, request) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 180, in sign auth = self.get_auth_instance(**kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/signers.py", line 284, in get_auth_instance frozen_credentials = self._credentials.get_frozen_credentials() File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 610, in get_frozen_credentials self._refresh() File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 498, in _refresh self._protected_refresh(is_mandatory=is_mandatory_refresh) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 514, in _protected_refresh metadata = self._refresh_using() File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 661, in fetch_credentials return self._get_cached_credentials() File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 671, in _get_cached_credentials response = self._get_credentials() File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/credentials.py", line 905, in _get_credentials return client.assume_role_with_web_identity(**kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call return self._make_api_call(operation_name, kwargs) File "/tmp/ray/session_2023-10-09_11-33-01_981057_8/runtime_resources/pip/b14b3dc7efa0e68a2b779b6894d9eda4b2d6d92b/virtualenv/lib/python3.10/site-packages/botocore/client.py", line 976, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity ```

I've also passed kubernetes_service_account_v1.ray_single_user_sa.metadata[0].name to both head.serviceAccountName and worker.serviceAccountName.

elyall commented 1 year ago

It looks like the role is mounted correctly:

❯ kubectl -n ray exec -it ray-cluster-cpu-kuberay-head-5ktw2 -- env | grep AWS
Defaulted container "ray-head" out of: ray-head, autoscaler
AWS_STS_REGIONAL_ENDPOINTS=regional
AWS_REGION=us-west-2
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_DEFAULT_REGION=us-west-2
AWS_ROLE_ARN=arn:aws:iam::XXXXXXXXXX:role/eks-stage-ray-single-user-sa
Here's Jupyterhub's for reference: ```shell ❯ kubectl -n jupyterhub exec -it jupyter-evan -- env | grep AWS Defaulted container "notebook" out of: notebook, block-cloud-metadata (init) AWS_DEFAULT_REGION=us-west-2 AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token AWS_STS_REGIONAL_ENDPOINTS=regional AWS_ROLE_ARN=arn:aws:iam::XXXXXXXXXX:role/eks-stage-jupyterhub-single-user-sa AWS_REGION=us-west-2 ```

Also here's the trust relationship for the role via the aws console:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::XXXXXXXXXX:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX:sub": "system:serviceaccount:ray:ray-single-user",
                    "oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX:aud": "sts.amazonaws.com"
                }
            }
        }
    ]
}

It's possible this issue is with how I'm using ray as currently my ray.remote function calls the ray_dask_get scheduler meaning the remote job tries to create more remote jobs on the ray cluster. Though this is a strange error if that is indeed the issue. I can adjust my script so that the parent job is performed locally instead of on the cluster and see if that works. The issue seems to occur regardless (i.e. without the recurrent remote calls).

elyall commented 1 year ago

I just validated that I get the same error when trying to read from S3 on my jupyterhub deployment, despite following the guide and attaching arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess to jupyterhub_single_user_irsa. Is there a policy I first need to attach to my AWS account to allow sts:AssumeRoleWithWebIdentity to work? I'll look through the blueprints/documentation again to see if I missed something.

elyall commented 1 year ago

I realize I've potentially gotten off topic from my original question. @askulkarni2 answered the question in theory. You're welcome to close the issue or leave it open for task planning the ray blueprint update. I will create a new issue with the bug I'm seeing and try to create a minimal code reproduction.

rishabh1815769 commented 1 day ago

Hi, @vara-bonthu @askulkarni2 I was encountering the same issue and I resolved it by modifying the trust relationship for "jupyterhub-on-eks-jupyterhub-single-user-sa" IAM role. Should I make a pull request with the fix?