awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
556 stars 185 forks source link

Jupyterhub is not working as expected. #381

Closed perrydevrekomodo closed 6 months ago

perrydevrekomodo commented 7 months ago

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

⚠️ Note

Before you submit an issue, please perform the following for Terraform examples:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Steps to reproduce the behavior: https://github.com/awslabs/data-on-eks/tree/main/ai-ml/jupyterhub

Not using workspaces

Yes, I have cleared the local cache

Port forwarded using the below command.

aws eks --region us-west-2 update-kubeconfig --name jupyterhub-on-eks kubectl port-forward svc/proxy-public 8080:80 -n jupyterhub http://localhost:8080/

Expected behavior

Upon sign-in, click Data Science option to trigger the Karpenter provisioner to launch a new g5.2xlarge instance, schedule a user-1 JupyterHub pod on it, and fetch the Docker image.

Actual behavior

I get the below errors


9:38```

```2023-12-04T20:25:24Z [Warning] Failed to schedule pod, incompatible with provisioner "inferentia", daemonset overhead={"cpu":"680m","memory":"620Mi","pods":"6"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule; incompatible with provisioner "gpu-ts", daemonset overhead={"cpu":"680m","memory":"620Mi","pods":"6"}, no instance type satisfied resources {"cpu":"2680m","memory":"4716Mi","nvidia.com/gpu":"1","pods":"7"} and requirements NodeGroupType In [gpu-ts], hub.jupyter.org/node-purpose In [user], karpenter.k8s.aws/instance-family In [g5], karpenter.k8s.aws/instance-size In [16xlarge 24xlarge 2xlarge 4xlarge 8xlarge and 1 others], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/provisioner-name In [gpu-ts], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], provisioner In [gpu-ts] (no instance type met the scheduling requirements or had a required offering); incompatible with provisioner "gpu", daemonset overhead={"cpu":"680m","memory":"620Mi","pods":"6"}, incompatible requirements, key karpenter.sh/provisioner-name, karpenter.sh/provisioner-name In [gpu-ts] not in karpenter.sh/provisioner-name In [gpu]; incompatible with provisioner "default", daemonset overhead={"cpu":"680m","memory":"620Mi","pods":"6"}, incompatible requirements, key karpenter.sh/provisioner-name, karpenter.sh/provisioner-name In [gpu-ts] not in karpenter.sh/provisioner-name In [default]; incompatible with provisioner "trainium", daemonset overhead={"cpu":"680m","memory":"620Mi","pods":"6"}, did not tolerate aws.amazon.com/neuroncore=true:NoSchedule; did not tolerate aws.amazon.com/neuron=true:NoSchedule
9:38```

```2023-12-04T20:25:28.372543Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..```

### Terminal Output Screenshot(s)

<!-- Optional but helpful -->

## Additional context

<!-- Add any other context about the problem here -->
vara-bonthu commented 7 months ago

@lusoal Would you be able to check this one?

lusoal commented 7 months ago

Sure let this up to me, will do tomorrow

asmacdo commented 7 months ago

384 Did not fix the issue for me, I ran ./cleanup removed .terraform and reinstalled.

From the browser and the jupyterhub operator, I am getting similar errors about the nodes

2023-12-10T23:23:49.061887Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2023-12-10T23:23:50Z [Normal] Pod should schedule on: nodeclaim/gpu-ts-xln8j
2023-12-10T23:24:22Z [Normal] pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
2023-12-10T23:29:10.795105Z [Warning] 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2023-12-10T23:39:40Z [Normal] Pod should schedule on: nodeclaim/gpu-ts-cvkdn
Spawn failed: Timeout

The nodeclaim seems to give a more helpful error:

`creating instance, with fleet error(s), InvalidParameter: Security group sg-063f3ce9c9129f915 and subnet subnet-03c6b3adbd7c25760 belong to different networks.; InvalidParameter: Security group sg-063f3ce9c9129f915 and subnet subnet-0355df2dfc137557b belong to different networks.

nodeclaim.yaml.json