awslabs / autonomous-driving-data-framework

ADDF is a collection of modules, deployed using the SeedFarmer orchestration tool. ADDF modules enable users to quickly bootstrap environments for the process and analysis of autonomous driving data.
Apache License 2.0
108 stars 45 forks source link

[BUG] Module deployment failure: jupyter-hub #583

Open serge-dolgavin-dxc opened 1 week ago

serge-dolgavin-dxc commented 1 week ago

Describe the bug

addf-demo-ide-jupyter-hub deployment failure, due to no longer supported runtime.

To Reproduce deploy jupyter-hub module

Expected behavior jupyter-hub deployed without issues

Screenshots na

Additional context ... Failed resources: addf-demo-ide-jupyter-hub | 10:17:07 AM | CREATE_FAILED | AWS::Lambda::Function | AWSCDKCfnUtilsProviderCustomResourceProvider/Handler handler returned message: "The runtime parameter of nodejs12.x is no longer supported for creating or updating AWS Lambda functions. We recommend you use a supported runtime while creating or updating functions. (Service: Lambda, Status Code: 400,

❌ addf-demo-ide-jupyter-hub failed: Error: The stack named addf-demo-ide-jupyter-hub failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE ...

malachi-constant commented 1 week ago

@serge-dolgavin-dxc

You can try this before merge, I need to test from the groundup so this may take awhile before its merged.

File: ide-modules.yaml

name: jupyter-hub
path: git::https://github.com/awslabs/autonomous-driving-data-framework.git//modules/demo-only/jupyter-hub?ref=chore/583&depth=1
serge-dolgavin-dxc commented 1 week ago

@malachi-constant ,

unfortunately,

name: jupyter-hub
path: git::https://github.com/awslabs/autonomous-driving-data-framework.git//modules/demo-only/jupyter-hub?ref=chore/583&depth=1

doesn't work for me:

$ seedfarmer apply ./manifests/demo/deployment.yaml --dry-run
...
[2024-09-05 07:10:17,386 | INFO | _deployment_commands.py:636 | MainThread ]  Verifying all modules in ide for deploy 
Traceback (most recent call last):
...
  cmdline: git pull -v -- origin chore/583
  stderr: 'fatal: couldn't find remote ref chore/583'

During handling of the above exception, another exception occurred:
...
.../autonomous-driving-data-framework/.venv/lib/python3.8/site-packages/seedfarmer/mgmt/git_support.py", line 79, in clone_module_repo
    raise InvalidConfigurationError(f"\n Cannot Clone Repo: {ge} {messages.git_error_support()}")
seedfarmer.errors.seedfarmer_errors.InvalidConfigurationError: 
 Cannot Clone Repo: Cmd('git') failed due to: exit code(1)
  cmdline: git pull -v -- origin chore/583
  stderr: 'fatal: couldn't find remote ref chore/583' 
    1. Make sure your path to the repo is correct and valid (check your module manifests!)
    2. The credentials used to call SeedFarmer have access to the repo
    3. The credentials used to call SeedFarmer have not expired
serge-dolgavin-dxc commented 1 week ago

@malachi-constant ,

with

name: jupyter-hub
path: modules/demo-only/jupyter-hub/

I got the following error:

... addf-demo-ide-jupyter-hub | 4/11 | 7:17:32 AM | CREATE_IN_PROGRESS | Custom::AWSCDK-EKS-KubernetesResource | addf-demo-ide-jupyter-hub-eks-cluster/manifest-jupyter-hub-namespace/Resource/Default (addfdemoidejupyterhubeksclustermanifestjupyterhubnamespaceXXXXXXXXXXXX) Resource creation Initiated

1321 | addf-demo-ide-jupyter-hub | 4/11 | 7:17:33 AM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | addf-demo-ide-jupyter-hub-eks-cluster/manifest-jupyter-hub-namespace/Resource/Default (addfdemoidejupyterhubeksclustermanifestjupyterhubnamespaceXXXXXXXX) Received response status [FAILED] from custom resource. Message returned: Error: b'\nAn error occurred (AccessDenied) when calling the AssumeRole operation: User: arn:aws:sts::XXXXXXXXX:assumed-role/addf-demo-ide-jupyter-hub-HandlerServiceRoleXXXXXXXXXXXXXXX/addf-demo-ide-jupyter-hub-addfdemo-HandlerXXXXXXXXXXXX is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::XXXXXXXXXXXXX:role/addf-demo-core-eks-clusterCreationRoleXXXXXXXXXXX\nUnable to connect to the server: getting credentials: exec: executable aws failed with exit code 255\n' ...

serge-dolgavin-dxc commented 1 week ago

@malachi-constant ,

please find the attached the codebuild log for jupyter-hub module: jupyter-hub_CodeBuild.log

malachi-constant commented 1 week ago

@serge-dolgavin-dxc Can you try this module from main that branch was deleted after merge

serge-dolgavin-dxc commented 1 week ago

@malachi-constant ,

Sorry that my messages are not clear and for confusion.

I have recognized that the branch was deleted and I am already using main for the last 5 days.

The yesterday's codebuild log for jupyter-hub module is based of the recent main branch.

malachi-constant commented 1 week ago

Gotcha missed that, taking a look...

malachi-constant commented 1 week ago

@serge-dolgavin-dxc Are you able to provide the trust policy for arn:aws:iam::XXXXXXXXXXXXX:role/addf-demo-core-eks-clusterCreationRoleXXXXXXXXXXX\ with account values sanitized as well so I compare to what I have tested? I am not able to replicate.

serge-dolgavin-dxc commented 6 days ago

@malachi-constant , please find the attached policy details along with the latest codebuild log. jupyter-hub.zip

malachi-constant commented 6 days ago

Ok so the trust is not being added for some reason, can you also tell me which version of the eks module is deployed?

serge-dolgavin-dxc commented 5 days ago

I am using the latest main branch (default demo / example-dev manifests).

name: eks
path: git::https://github.com/awslabs/idf-modules.git//modules/compute/eks?ref=release/1.11.0
dataFiles:
  - filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/1.29.yaml?ref=release/1.11.0
  - filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/default.yaml?ref=release/1.11.0
malachi-constant commented 5 days ago

Ok thanks, was able to replicate, working on it...

malachi-constant commented 5 days ago

@serge-dolgavin-dxc

See manifest in PR

This error is resolved by updating ide-modules.yaml

name: jupyter-hub
path: modules/demo-only/jupyter-hub/
parameters:
 - name: eks-cluster-admin-role-arn
   valueFrom:
     moduleMetadata:
       group: core
       name: eks
       key: EksClusterMasterRoleArn
serge-dolgavin-dxc commented 4 days ago

@malachi-constant , thanks a lot for your help! I was able to deploy jupyter-hub module.

┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Account ┃ Region    ┃ Deployment ┃ Group       ┃ Module           ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ primary │ eu-west-1 │ demo       │ optionals   │ networking       │
│ primary │ eu-west-1 │ demo       │ optionals   │ datalake-buckets │
│ primary │ eu-west-1 │ demo       │ replication │ replication      │
│ primary │ eu-west-1 │ demo       │ core        │ metadata-storage │
│ primary │ eu-west-1 │ demo       │ core        │ eks              │
│ primary │ eu-west-1 │ demo       │ core        │ batch-compute    │
│ primary │ eu-west-1 │ demo       │ core        │ efs              │
│ primary │ eu-west-1 │ demo       │ ide         │ jupyter-hub      │
└─────────┴───────────┴────────────┴─────────────┴──────────────────┘

Unfortunately, I got two issues after the deployment:

  1. I was not able to query the DNS Name of the JupyterHub

    $ kubectl get ing jupyterhub -n jupyter-hub -o jsonpath="{.status.loadBalancer.ingress[0].hostname}"
    E0913 07:48:55.773416   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
    E0913 07:48:55.778115   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
    E0913 07:48:55.781898   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
    E0913 07:48:55.785906   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
    E0913 07:48:55.794678   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
    Unable to connect to the server: dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host

    Please notice regions. ADDF demo was deployed in eu-west-1, not eu-central-1.

  2. Spawn failed after authentication on jupyter-hub:

Event log
Server requested
2024-09-13T05:51:01.159711Z [Normal] Successfully assigned jupyter-hub/jupyter-testadmin to ip-10-0-5-247.eu-west-1.compute.internal
2024-09-13T05:51:05Z [Normal] AttachVolume.Attach succeeded for volume "pvc-476aa8dd-ff44-4961-bd31-e335e243b2c2"
2024-09-13T05:51:06Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c32ec298a72142146904b12cd76eed4d0de1cb67d0bcffe61ace594ef57748f4": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:51:20Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "947b92fa3b8b39f8d5739e39c4c3fb9dd4ec4c086e9ab1c245c071f4d830ba01": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:51:33Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "812dcbfd7cc2d97c2557ff5e647fd0459b3578b5bf57266ea52a32f61e24b4be": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:51:46Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "814fcaf4f3b16a97d283b3ebec306bf2c04c8cf18f223648c954300a9ddfa72e": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:00Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "07333715950cac4d843c6d86f2c3cddff3ee6e9089303c427e7036dd0c255a83": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:12Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4adad6b61854075ae7cc294aa9879bdebb999dc3f12e32c882e5386fa4a711f6": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:25Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9f420ca72cbf547e4e5fca53640b569ed68ede236b597f1c1b9d7ba00e666aea": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:39Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b8d0932f530f7347b9119f49782f83cbf9af1434a5330ad6ce3fa146187b5f31": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:51Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "bc3e7ba7bd1eae2c9279fc247d72dcd87355413ce1919b58f3e451c159ff39cd": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:53:04Z [Warning] (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "66b1d8132d73dcbba43f04acc1fe2926c56e641a060d9bcb5f12833f20f7c284": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
Spawn failed: pod jupyter-hub/jupyter-testadmin did not start in 300 seconds!

Could you please advise how to address these issues?

dgraeber commented 4 days ago

@serge-dolgavin-dxc

I think your credentials for kubectl are pointing to the wrong cluster (do you have multiple clusters defined in .kube?)...this command: kubectl get ing jupyterhub -n jupyter-hub -o jsonpath="{.status.loadBalancer.ingress[0].hostname}" Should be executed against the proper cluster... REF: https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/

serge-dolgavin-dxc commented 1 day ago

@dgraeber , thanks a lot for your hint!

addf-demo-core-eks-cluster configuration was missing.

The first issue was solved, but the second still remain. Is it an issue with access rights?