Major issues in kubeflow deployment on AWS

tejarora commented 1 year ago

2.0 Set up EFS: 2.1 is automated setup in which you talk about a security group

2.2 is manual setup, there is no security group involved at all... That's odd

I used 2.2 to see all the steps. Completed all steps successfully. But EFS volume creation does not work. I described the pvc and I see this

"Warning ProvisioningFailed 4m10s (x14 over 24m) efs.csi.aws.com_efs-csi-controller-5cb8767cfc-gts52_83ea0575-16d9-4d23-9c3a-603decb87004 failed to provision volume with StorageClass "efs-sc": rpc error: code = Internal desc = Failed to fetch File System info: Describe File System failed: NoCredentialProviders: no valid providers in chain. Deprecated."

It looks like 2.2 is broken....

Its been a struggle. Can you please look into this and update soon? Thanks.

tejarora commented 1 year ago

This is the doc https://awslabs.github.io/kubeflow-manifests/docs/add-ons/storage/efs/guide/ Using v1.6.1 branch of kubeflow-manifests

tejarora commented 1 year ago

I realized that security group is used while creating EFS file system manually.

In any case, tried automated setup also and even with that I'm unable to create a notebook server. It stays in pending for a very long time, the PVCs are also pending (they don't go to bound state)

Are the EFS instructions verified to work with v1.6.1?

I must tell you its been a very frustrating experience trying to get kubeflow to run on aws eks.

Tried terraform deployment (s3/rds/cognito). Broken. Wiped everything clean and tried Terraform (s3/rds). Broken. Tried manual deployment. All good upto EFS setup. I can see the DEX UI. login, and see central dashboard. When I create volumes, the PVCs are in pending state. When I create a notebook server, the pod is in pending state.... forever

ryansteakley commented 1 year ago

Hello @tejarora , I have verified that the automated setup works and that the notebook server comes online on v1.6.1. Is the notebook server still unavailble? Can run kubectl describe on the PVC and the notebook server pod? This will help us get more information so we can resolve these issues.

Additionally can you check the logs of the efs-csi-driver? As well the output of kubectl describe -n kube-system serviceaccount efs-csi-controller-sa and share the policy that is attached to the role in your service-account?

Additionally the rest of the terraform deployment options that you mention as broken, I have also verified that they are working. Do you have any insights into what issues you faced with the terraform deployments?

surajkota commented 1 year ago

@tejarora please let us know if you have any update on this issue

tejarora commented 1 year ago

Hello Ryan & Suraj, I will update here in a couple of days. Got pulled into other priorities. Thanks.

tejarora commented 1 year ago

Here's some observations and feedback. I may have another round of feedback as my kubeflow deployment on AWS is still work in progress.

First I tried the terraform deployment with cognito, rds, s3 option. The problem I ran into was that my Route 53 zone was in an account that was different from the one in which kubeflow was getting deployed. This is my first time with terraform and I didn't want to venture into modifying the terraform manifests to make cross-account zone access work. So I abandoned this approach.

Second I tried the terraform deployment with just rds, s3. This was repeatedly failing with eks cluster access issues (i/o timeout). I'm aware there are occasional glitches when I administer my EKS cluster using kubectl. I have experienced quality issues with the instances deployed for the EKS API Server endpoint. I fine one of the instances sick more often than I'd like. How do I know? nslookup on the API server endpoint, telnet the IPs - usually 2 and one of them times out. The sick instance(s) do get fixed in some time. Coming back to the terraform deployment, I tried the deployment 12 times (over 2 days), and it failed with i/o timeout every single time... I don't think this has to do my network connectivity, as I'm able to administer my other clusters without problems... There is some issue on the AWS end, the API server endpoint instances aren't most reliable. The terraform deployment may have knobs to adjust timeouts and retries.... these will certainly need to be tuned to make the deployment more resilient.

One observation common to both the terraform deployments is that the EKS cluster deployed has no cluster auto-scaler. Basically there was no node auto scaling up/down (nodes with all pods drained were not getting removed, and pods were waiting in pending state and no new nodes got created). This is a major deficiency, and unacceptable for a production deployment.

Another observation was that nodes were getting over-provisioned by default - 5 m5.xlarge instances was a minimum!! This is very aggressive and should be easily configurable, but it was not. After my manual deployment efforts, I found that just TWO instances were more than sufficient to run ALL of the kubeflow components (about 80 or so pods). And I deployed karpenter for auto scaling and didn't have to worry about additional nodes getting automatically provisioned when necessary.

After no success with terraform deployments, I began the path of manual deployments with Kustomize option. I set up my own EKS cluster (with karpenter for auto-scaling and nginx controller for ingress, a publicly accessible domain for kubeflow). I was able to get to a point when the kubeflow cluster was stable and I was able to see the DEX UI for login with the default user@example.com profile using the custom domain.

Then I ventured into deploying the EFS add-on. That has been very problematic. I tried the manual steps. Didn't work. Then tried the automated script. Didn't work. Basically, EFS volumes are showing in the UI - they are not in BOUND state. When I spin up a notebook server, it does not get provisioned - times out.

I found this when I did a kubectl describe pvc on the EFS pvc Warning ProvisioningFailed 4m10s (x14 over 24m) efs.csi.aws.com_efs-csi-controller-5cb8767cfc-gts52_83ea0575-16d9-4d23-9c3a-603decb87004 failed to provision volume with StorageClass "efs-sc": rpc error: code = Internal desc = Failed to fetch File System info: Describe File System failed: NoCredentialProviders: no valid providers in chain. Deprecated.

notebook controller logs 1.6744542220443478e+09 ERROR controllers.Notebook Could not find container with the same name as Notebook in containerStates of Pod. Will not update notebook's status.containerState {"notebook": "kubeflow-user-example-com/tejtestnb"}

describe the notebook pod Warning FailedScheduling 107s default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition

I am going to clean up, and once again try deploying the EFS add-on. Will update here on the outcome.

Meanwhile, I think I've unearthed plenty of issues with the prescribed kubeflow deployment on AWS. Hope they are taken up for rectification.

tejarora commented 1 year ago

I cleaned up and ran the automated efs script utils/auto-efs-setup.py.

To answer Ryan's questions

$kubectl describe -n kube-system serviceaccount efs-csi-controller-sa Name: efs-csi-controller-sa Namespace: kube-system Labels: app.kubernetes.io/managed-by=eksctl app.kubernetes.io/name=aws-efs-csi-driver Annotations: eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxx:role/xxxxxxxxx Image pull secrets: Mountable secrets: efs-csi-controller-sa-token-j9lbp Tokens: efs-csi-controller-sa-token-j9lbp Events:

The policy attached to the role in the annotation is:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "elasticfilesystem:DescribeAccessPoints", "elasticfilesystem:DescribeFileSystems", "elasticfilesystem:DescribeMountTargets", "ec2:DescribeAvailabilityZones" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "elasticfilesystem:CreateAccessPoint" ], "Resource": "*", "Condition": { "StringLike": { "aws:RequestTag/efs.csi.aws.com/cluster": "true" } } }, { "Effect": "Allow", "Action": "elasticfilesystem:DeleteAccessPoint", "Resource": "*", "Condition": { "StringEquals": { "aws:ResourceTag/efs.csi.aws.com/cluster": "true" } } } ] }

After running the automated script, noticed that the efs-csi-node-sa serviceaccount has no role annotated... Looks like something is missing.... Let me know

kubectl describe sa efs-csi-node-sa -n kube-system Name: efs-csi-node-sa Namespace: kube-system Labels: app.kubernetes.io/name=aws-efs-csi-driver Annotations: Image pull secrets: Mountable secrets: efs-csi-node-sa-token-n742p Tokens: efs-csi-node-sa-token-n742p Events:

The good news is that through the dex UI I was able to create EFS volumes and a notebook instance. The notebook instance came up successfully and the volumes are in bound state.

Noticed one major difference between automated install (uses 1.4.0 of csi driver) and manual install (recommends 1.3.4) - the policy attached to the efs-csi-controller-sa role in the manual setup has fewer permissions. It is missing "elasticfilesystem:DescribeMountTargets", "ec2:DescribeAvailabilityZones".... That may have been the issue for me... Not sure.

tejarora commented 1 year ago

Some warnings when I ran the automated EFS script

$ python utils/auto-efs-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME --efs_file_system_name $CLAIM_NAME --efs_security_group_name $SECURITY_GROUP_TO_CREATE /usr/lib/python3/dist-packages/jmespath/visitor.py:32: SyntaxWarning: "is" with a literal. Did you mean "=="? if x is 0 or x is 1: /usr/lib/python3/dist-packages/jmespath/visitor.py:32: SyntaxWarning: "is" with a literal. Did you mean "=="? if x is 0 or x is 1: /usr/lib/python3/dist-packages/jmespath/visitor.py:34: SyntaxWarning: "is" with a literal. Did you mean "=="? elif y is 0 or y is 1: /usr/lib/python3/dist-packages/jmespath/visitor.py:34: SyntaxWarning: "is" with a literal. Did you mean "=="? elif y is 0 or y is 1: /usr/lib/python3/dist-packages/jmespath/visitor.py:260: SyntaxWarning: "is" with a literal. Did you mean "=="? if original_result is 0:

tejarora commented 1 year ago

Minor issue in script... It prints "CLUSTER PUBLIC SUBNETS" before creating mount targets. It actually prints private subnets (in my case)... The right thing to say is "SUBNETS WHERE NODES ARE CREATED" (could be public or private)

ryansteakley commented 1 year ago

Thank you @tejarora for taking the time to provide all this valuable feedback. I really appreciate the detailed explanation of what you feel can be improved for the AWS Kubeflow Documentation. Our example eksctl create cluster commands contains several nodes and is intended for a user to quickly install Kubeflow and begin testing it out. We can perhaps call out autoscaling functionality as it pertains to prod-level workloads.

The issue you have run into with the rds-s3 deployment option seems to be related to EKS cluster creation not being up to par correct? Can you detail which version of EKS Cluster you are trying to create? As in you did not get to steps such as installing the manifests but were failing around cluster creation time? If you still have some of these failed eks clusters, I can reach out to some folks for some insight into why you are running into this issue.

We also noted that difference between the manual and auto-deployment and will most likely be updating the manual-deployment options permissions. Sorry that it may have caused issues.

So as of now everything related to efs is working and you can use volume within your notebook?

Noted for the minor issue in the script, will re-create the issue to verify and create the proper print message.

tejarora commented 1 year ago

The issue you have run into with the rds-s3 deployment option seems to be related to EKS cluster creation not being up to par correct? The cluster gets created fine. The instances serving the cluster API server endpoint are sub-par... Customers have no control over these instances... They are sick fairly often.

So as of now everything related to efs is working and you can use volume within your notebook? Yes, volumes are created and bound and the notebook instance is up and running ... The data scientists/ml engineers will use kubeflow now and hopefully all will be smooth

tejarora commented 1 year ago

Also, please clarify this (possible issue?)

kubectl describe sa efs-csi-node-sa -n kube-system Name: efs-csi-node-sa Namespace: kube-system Labels: app.kubernetes.io/name=aws-efs-csi-driver Annotations: Image pull secrets: Mountable secrets: efs-csi-node-sa-token-n742p Tokens: efs-csi-node-sa-token-n742p Events:

This serviceaccount has no permissions (i.e. no role shows up in annotations). This doesn't look right.

ryansteakley commented 1 year ago

Taking a look into the serviceaccount will investigate.

@tejarora looks like this is correct, we expect that the Service Account efs-csi-controller-sa to be annotated which one can verify by runningkubectl describe -n kube-system serviceaccount efs-csi-controller-sa

Correctly annotated it will look like

Name:                efs-csi-controller-sa
Namespace:           kube-system
Labels:              app.kubernetes.io/managed-by=eksctl
                     app.kubernetes.io/name=aws-efs-csi-driver
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/eksctl-s3-only-irsa-addon-iamserviceaccount-Role1-JV6MCKC4FAUL
Image pull secrets:  <none>
Mountable secrets:   efs-csi-controller-sa-token-427f8
Tokens:              efs-csi-controller-sa-token-427f8
Events:              <none>

techwithshadab commented 1 year ago

Hi, I'm deploying kubeflow on AWS using Terraform following the steps over here- https://awslabs.github.io/kubeflow-manifests/docs/deployment/prerequisites/ This is the code repo- https://github.com/awslabs/kubeflow-manifests/tree/main/deployments/rds-s3/terraform But it's giving me below error:

Error: configuring Terraform AWS Provider: no valid credential sources for Terraform AWS Provider found.

Please see https://registry.terraform.io/providers/hashicorp/aws for more information about providing credentials.

Error: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: 
 GetMetadata, http response error StatusCode: 404, request to EC2 IMDS failed

 with provider["registry.terraform.io/hashicorp/aws"], on main.tf line 56, in provider "aws": 56: provider "aws" {

Makefile:20: recipe for target 'create-vpc' failed

Can someone help with it?

surajkota commented 1 year ago

@techwithshadab please run aws sts get-caller-identity to check if credentials are setup properly. Please open a new issue for further support since this is unrelated to current thread.

ryansteakley commented 1 year ago

No further comments from user on this issue.

awslabs / kubeflow-manifests

Major issues in kubeflow deployment on AWS #549