awslabs / kubeflow-manifests

KubeFlow on AWS
https://awslabs.github.io/kubeflow-manifests/
Apache License 2.0
156 stars 116 forks source link

EKS kubeflow cannot run fairing. #770

Open gabin0801 opened 12 months ago

gabin0801 commented 12 months ago

Describe the bug

I tried "kubeflow fairing" example provided by eksworkshop. Create ECR, push succeeds, but pull image from ECR fails. (NOT FOUND) So, the fairing job fails. The point is, the login and push to ECR succeeded, but the pull failed. So, it doesn't appear to be an authentication or permission issue.

The creation of eks and kubeflow was created by referring to the link below. https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.2/docs/deployment/rds-s3/guide-terraform/

Steps To Reproduce

The steps can be found at the link below. https://archive.eksworkshop.com/advanced/420_kubeflow/fairing/

jupyter notebook image: 527798164940.dkr.ecr.us-west-2.amazonaws.com/tensorflow-1.15.2-notebook-cpu:1.0.0

Environment

kubernetes version: 1.25

ubuntu@ip-172-31-10-31:~$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.25.11-eks-a5565ad

Using EKS: YES, 1.25 Kubeflow version: v1.7.0-aws-b1.0.2

Aditional Information

$ kubectl -n kubeflow-user-example-com describe po fairing-job-2t4pl-ts4w6
Name:             fairing-job-2t4pl-ts4w6
Namespace:        kubeflow-user-example-com
Priority:         0
Service Account:  default
Node:             ip-10-0-141-160.us-east-2.compute.internal/10.0.141.160
Start Time:       Thu, 13 Jul 2023 16:34:24 +0000
Labels:           controller-uid=603ac49b-7ac5-4def-9b80-bbe95eaa4a38
                  fairing-deployer=job
                  fairing-id=1f806850-219b-11ee-b562-7a013582cc76
                  job-name=fairing-job-2t4pl
Annotations:      sidecar.istio.io/inject: false
Status:           Pending
IP:               10.0.129.216
IPs:
  IP:           10.0.129.216
Controlled By:  Job/fairing-job-2t4pl
Containers:
  fairing-job:
    Container ID:  
    Image:         205368011395.dkr.ecr.us-east-2.amazonaws.com/fairing-job:658A7AA3
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      python
      /app/function_shim.py
      --serialized_fn_file
      /app/pickled_fn.p
      --python_version
      3.6
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:
      FAIRING_RUNTIME:  1
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ktb7f (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-ktb7f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  19m                   default-scheduler  Successfully assigned kubeflow-user-example-com/fairing-job-2t4pl-ts4w6 to ip-10-0-141-160.us-east-2.compute.internal
  Normal   Pulling    18m (x4 over 19m)     kubelet            Pulling image "205368011395.dkr.ecr.us-east-2.amazonaws.com/fairing-job:658A7AA3"
  Warning  Failed     18m (x4 over 19m)     kubelet            Failed to pull image "205368011395.dkr.ecr.us-east-2.amazonaws.com/fairing-job:658A7AA3": rpc error: code = NotFound desc = failed to pull and unpack image "205368011395.dkr.ecr.us-east-2.amazonaws.com/fairing-job:658A7AA3": failed to copy: httpReadSeeker: failed open: content at https://205368011395.dkr.ecr.us-east-2.amazonaws.com/v2/fairing-job/manifests/sha256:72fb0cd8c76dbb0419f3a49a474ccec62e653022a7a79a82ba6e5801d5abe282 not found: not found
  Warning  Failed     18m (x4 over 19m)     kubelet            Error: ErrImagePull
  Warning  Failed     17m (x6 over 19m)     kubelet            Error: ImagePullBackOff
  Normal   BackOff    4m23s (x65 over 19m)  kubelet            Back-off pulling image "205368011395.dkr.ecr.us-east-2.amazonaws.com/fairing-job:658A7AA3"
rd-pong commented 11 months ago

This guide has been deprecated and archived, so some of the resources/links might not work. Do you mind checking the new Amazon EKS Workshop is now available at www.eksworkshop.com?

gabin0801 commented 11 months ago

Did you visit that site? There's no kubeflow examples and the guide is working. I think you don't understand the focus of the problem.

The point is that when using the fairing library, push to ECR works but pull does not, so k8s's "Job" not work.

I know that EKS 1.25 uses containerd, not docker. So, I'm not exactly sure if the problem is the fairing library or the eks node.

sure thing is, When I tested it with kubeflow version 1.6, EKS 1.22, I confirmed that it works normally. But, kubeflow version 1.7, EKS 1.25, it does not work.

raykrueger commented 11 months ago

It looks like you have permissions to pull and are getting "not found" responses from ECR. Double check that the region you pushed to matches the region you're pulling from. Also check that the name is correct.

Can you paste the log messages where the successful ECR push is logged? I think that's all in the 02_01_fairing_introduction notebook.

gabin0801 commented 11 months ago

@raykrueger I attach the log file you requested. (EKS 1.25, kubeflow v1.7.0-aws-b1.0.2)

Remote training

# Authenticate ECR
# This command retrieves a token that is valid for a specified registry for 12 hours, 
# and then it prints a docker login command with that authorization token. 
# Then we executate this command to login ECR

REGION='ap-northeast-3'
!eval $(aws ecr get-login --no-include-email --region=$REGION)
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/jovyan/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
# Create an ECR repository in the same region
# If you receive "RepositoryAlreadyExistsException" error, it means the repository already
# exists. You can move to the next step
!aws ecr create-repository --repository-name fairing-job --region=$REGION
{
    "repository": {
        "repositoryArn": "arn:aws:ecr:ap-northeast-3:468063208806:repository/fairing-job",
        "registryId": "468063208806",
        "repositoryName": "fairing-job",
        "repositoryUri": "468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job",
        "createdAt": 1689651359.0,
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        }
    }
}
# Setting up AWS Elastic Container Registry (ECR) for storing output containers
# You can use any docker container registry istead of ECR
AWS_ACCOUNT_ID=fairing.cloud.aws.guess_account_id()
AWS_REGION='ap-northeast-3'
DOCKER_REGISTRY = '{}.dkr.ecr.{}.amazonaws.com'.format(AWS_ACCOUNT_ID, AWS_REGION)

fairing.config.set_builder('append', base_image='tensorflow/tensorflow:1.15.0-py3', registry=DOCKER_REGISTRY, push=True)
fairing.config.set_deployer('job')

if __name__ == '__main__':
    remote_train = fairing.config.fn(train)
    remote_train()
[I 230718 03:36:02 config:125] Using preprocessor: <kubeflow.fairing.preprocessors.function.FunctionPreProcessor object at 0x7f40fd46e198>
[I 230718 03:36:02 config:127] Using builder: <kubeflow.fairing.builders.append.append.AppendBuilder object at 0x7f40abb7eb00>
[I 230718 03:36:02 config:129] Using deployer: <kubeflow.fairing.deployers.job.job.Job object at 0x7f40abb7eb70>
[W 230718 03:36:02 append:50] Building image using Append builder...
[I 230718 03:36:02 base:107] Creating docker context: /tmp/fairing_context_n5goe_j3
[W 230718 03:36:02 base:94] /usr/local/lib/python3.6/dist-packages/kubeflow/fairing/__init__.py already exists in Fairing context, skipping...
[I 230718 03:36:02 docker_creds_:234] Loading Docker credentials for repository 'tensorflow/tensorflow:1.15.0-py3'
[W 230718 03:36:04 append:54] Image successfully built in 1.585705624995171s.
[W 230718 03:36:04 append:94] Pushing image 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E...
[I 230718 03:36:04 docker_creds_:234] Loading Docker credentials for repository '468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E'
[W 230718 03:36:04 append:81] Uploading 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E
[I 230718 03:36:04 docker_session_:284] Layer sha256:295d41931b472fa7d61e363497149a301fea37ab5ec9f0ea9916c86791c70b9c pushed.
[I 230718 03:36:04 docker_session_:284] Layer sha256:6fdd1bedaf2e49c66538fcc4e18b1f91d6fd4ba6e09886d242f4a217299f9e7a pushed.
[I 230718 03:36:05 docker_session_:284] Layer sha256:c58094023a2e61ef9388e283026c5d6a4b6ff6d10d4f626e866d38f061e79bb9 pushed.
[I 230718 03:36:05 docker_session_:284] Layer sha256:ac66bd508effe7f728663c81ae23e8a4f34ba7f707cea469e5c242ca544fe464 pushed.
[I 230718 03:36:05 docker_session_:284] Layer sha256:079b6d2a1e53c648abc48222c63809de745146c2ee8322a1b9e93703318290d6 pushed.
[I 230718 03:36:05 docker_session_:284] Layer sha256:11048ebae90883c19c9b20f003d5dd2f5bbf5b48556dabf06c8ea5c871c8debe pushed.
[I 230718 03:36:06 docker_session_:284] Layer sha256:094a8f5dd2cbe7e1bb8e970b4cf475516e3ecdbbdf673aeb454c6db226971e10 pushed.
[I 230718 03:36:06 docker_session_:284] Layer sha256:f5de9bda32bda66c3c4e1bef463925c0f649f1f8d9b20fdce4fcc4b761c50fab pushed.
[I 230718 03:36:06 docker_session_:284] Layer sha256:138c908b7d99825147bf3df37d6bca03cf4e0a48aded6b1dc14708adaa110f35 pushed.
[I 230718 03:36:07 docker_session_:284] Layer sha256:22e816666fd6516bccd19765947232debc14a5baf2418b2202fd67b3807b6b91 pushed.
[I 230718 03:36:08 docker_session_:284] Layer sha256:fb153ade6d147fb3ecf01f9cc24b489e684885946290c54503fa9667e0b587ac pushed.
[I 230718 03:36:12 docker_session_:284] Layer sha256:0db1490606495fd4e18934a9a6a645048f0e39869c1cb1c9e3a70141cb981878 pushed.
[I 230718 03:36:46 docker_session_:284] Layer sha256:354ee6535f236e958ab05585cfb532b2c20a9f18e2b45148896b0fcf77b819b5 pushed.
[I 230718 03:36:46 docker_session_:334] Finished upload of: 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E
[W 230718 03:36:46 append:99] Pushed image 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job:DD0A7D6E in 42.38909610599512s.
[W 230718 03:36:46 job:90] The job fairing-job-spqxz launched.
[W 230718 03:36:48 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:36:48 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:36:48 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:36:50 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:01 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:12 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:24 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:36 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:37:50 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:38:31 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...
[W 230718 03:38:45 manager:255] Waiting for fairing-job-spqxz-swmk8 to start...

The pod description is the same as the additional information above.

raykrueger commented 11 months ago

It looks like, you've successfully pushed the image to 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job but are trying to pull from 205368011395.dkr.ecr.us-east-2.amazonaws.com/fairing-job:658A7AA3

Update your notebook to pull from 468063208806.dkr.ecr.ap-northeast-3.amazonaws.com/fairing-job and you should make some progress.

gabin0801 commented 11 months ago

@raykrueger

  1. there's no latest tag image in ECR.
  2. when I inspect pod, image tag was right.
  3. when edit pod image tag, the issue is same.

I'm not sure if this is what you want.

surajkota commented 11 months ago

https://github.com/kubeflow/fairing/releases has not been updated since 3 years, is it a good use of time to try using it?

FYI: We dont test fairing as part of release

gabin0801 commented 11 months ago

I think there is a problem with the current eks node. When I set the eks version to 1.23 and tried it with kubeflow 1.7, fairing worked normally. However, it did not work from eks version 1.24, 1.25. I don't know exactly how it works inside the eks node, but since eks 1.24 and 1.25, it seems to use containerd instead of docker.

As mentioned above, push the image to the ecr using fairing It works normally, but it said that the pull does not work. So, after connecting directly to the node with ssh, I tried to directly pull the image stored in the ecr using the containerd command. However, this also results in a NOT FOUND message.

[ec2-user@ip-99-0-2-107 ~]$ ECR_PW=$(aws ecr get-login-password --region $ECR_REGION) [ec2-user@ip-99-0-2-107 ~]$ sudo ctr image pull --user "AWS:$ECR_PW" 468063208806.dkr.ecr.ap-southeast-1.amazonaws.com/fairing-job: D1EC59D5 468063208806.dkr.ecr.ap-southeast-1.amazonaws.com/fairing-job:D1EC59D5: resolved +++++++++++++++++++++++++++++++++++++++++| manifest-sha256:3dc7b691e6601990e79d056bbd9391bdefa7858cef59d7fecf88dfe5e777e5bd: downloading |--------------------------------------| 0.0 B/2.1 KiB elapsed: 0.1 s total: 0.0 B (0.0 B/s) ctr: failed to copy: httpReadSeeker: failed open: content at https://468063208806.dkr.ecr.ap-southeast-1.amazonaws.com/v2/fairing-job/manifests/sha256:65f54d89303c79227d1c772930127d287ba6cfc11921c78b109 3c90079b6ee53 not found: not found

So I don't know the exact cause. I think it might be one of three possibilities.

  1. Compatibility related to containerd in fairing.
  2. A bug in the eks node provided by aws.
  3. Problems with ECR.

In fact, the current problem has not been resolved at all, so I am using eks 1.23 and kubeflow 1.7. However, when the kubernetes version goes up, we have to upgrade as well, so I think it's a pretty serious problem.

kansinna commented 11 months ago

Looking for the similar case :

https://github.com/kubeflow/fairing/issues/380

​Using microk8s > 1.13 will hit this error since it uses microk8s.ctr and dockerd is replaced with containerd.

'Append builder' calls Layer Class method originally from containerregistry, however fairing has an older version. See append_.py code difference:

https://github.com/google/containerregistry/blob/8a11dc8c53003ecf5b72ffaf035ba280109356ac/client/v2_2/append_.py#L68

I've tried to change 'mediaType' to 'docker_http.LAYER_MIME' in fairing code, but still not work. The image manifest or digest seems not compatible. Need to check with containerregistry if containerd style image is supported and can be built with Layer Class method.

kansinna commented 11 months ago

Unfortunately there is some limitation on AWS kubeflow as below : If the issue is in the open source and we are consuming the upstream packages in our AMI from upstream, then we have to wait for the fix to be merged upstream and then we can ship it with our AMIs.