Open elyall opened 1 year ago
Thanks for the great examples! I altered the jupyterhub on eks example (for a private cluster accessed via a Tailscale VPN) and I'm now adding a ray cluster and trying to grant S3 access to the jobs running on karpenter nodes.
Amazing! Tailscale is a great addition.
While we did show S3 access via the karpenter module's iam_role_additional_policies
IIRC it was done so because RayCluster helm chart at the time did not have support for specifying a serviceAccountName
as value. I just looked and it seems they have now added support for it, i.e. you can now specify head.serviceAccountName
and worker.serviceAccountName
as helm values which is great. This enables us to use an IAM Roles for Service Accounts which is demonstrated in the JupyterHub example. This would be the preferred away over the karpenter node role as this would restrict access to S3 buckets only to the RayCluster pods.
I will update the ray blueprint soon but you can go about it on your own as it is shown for the JupyterHub example here ... https://github.com/awslabs/data-on-eks/blob/44cb0769afc752e57bfb2d11192ebcec1ce97389/ai-ml/jupyterhub/jupyterhub.tf#L4-L48
Then use this serviceAccountName
as value in the RayCluster helm chart (if you are using helm) or directly in RayCluster.yaml.
Optionally you can use our aws-ia/eks-blueprints-addons/aws
module which we have provided as a convenience if you want to avoid some of the boiler plate code to create the helm_release and IRSA in a single shot (this is what I will use).
HTH, and please let us know if you run into any issues.
This sounds fantastic! When I tried implementing it I get the following error:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity
I created an additional policy and attached it to the ray_single_user_irsa
but it results in the same error. Am I attaching it to the wrong role?
I've also passed kubernetes_service_account_v1.ray_single_user_sa.metadata[0].name
to both head.serviceAccountName
and worker.serviceAccountName
.
It looks like the role is mounted correctly:
❯ kubectl -n ray exec -it ray-cluster-cpu-kuberay-head-5ktw2 -- env | grep AWS
Defaulted container "ray-head" out of: ray-head, autoscaler
AWS_STS_REGIONAL_ENDPOINTS=regional
AWS_REGION=us-west-2
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
AWS_DEFAULT_REGION=us-west-2
AWS_ROLE_ARN=arn:aws:iam::XXXXXXXXXX:role/eks-stage-ray-single-user-sa
Also here's the trust relationship for the role via the aws console:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::XXXXXXXXXX:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX:sub": "system:serviceaccount:ray:ray-single-user",
"oidc.eks.us-west-2.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXXXXXXX:aud": "sts.amazonaws.com"
}
}
}
]
}
It's possible this issue is with how I'm using ray as currently my The issue seems to occur regardless (i.e. without the recurrent remote calls).ray.remote
function calls the ray_dask_get
scheduler meaning the remote job tries to create more remote jobs on the ray cluster. Though this is a strange error if that is indeed the issue. I can adjust my script so that the parent job is performed locally instead of on the cluster and see if that works.
I just validated that I get the same error when trying to read from S3 on my jupyterhub deployment, despite following the guide and attaching arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
to jupyterhub_single_user_irsa
. Is there a policy I first need to attach to my AWS account to allow sts:AssumeRoleWithWebIdentity
to work? I'll look through the blueprints/documentation again to see if I missed something.
I realize I've potentially gotten off topic from my original question. @askulkarni2 answered the question in theory. You're welcome to close the issue or leave it open for task planning the ray
blueprint update. I will create a new issue with the bug I'm seeing and try to create a minimal code reproduction.
Hi, @vara-bonthu @askulkarni2 I was encountering the same issue and I resolved it by modifying the trust relationship for "jupyterhub-on-eks-jupyterhub-single-user-sa" IAM role. Should I make a pull request with the fix?
Please describe your question here
Thanks for the great examples! I altered the jupyterhub on eks example (for a private cluster accessed via a Tailscale VPN) and I'm now adding a ray cluster and trying to grant S3 access to the jobs running on karpenter nodes. I was trying to use the same karpenter provisioners but how do I grant the jobs S3 access?
terraform-aws-modules/eks/aws//modules/karpenter
module and attaches the relevant policies via theiam_role_additional_policies
argument which is pretty straightforward.aws-ia/eks-blueprints-addons/aws
which ultimately usesaws-ia/eks-blueprints-addon/aws
. The two things I've tried that hasn't worked is:role_policies
inputaws_iam_role_policy_attachment
resources withrole = module.eks_blueprints_addons.karpenter.iam_role_name
.Also is there a preference for which module to use?
Provide a link to the example/module related to the question
jupyterhub ray
Additional context
I may just follow the ray example and generate karpenter resources outside of
aws-ia/eks-blueprints-addons/aws
.Also here's the policy I'm attaching:
And the error I'm getting is: