Open asmacdo opened 1 month ago
Hi @asmacdo, just checking in to see how this report generation is going? Thanks.
Ran into some problems
The du script ran for about 50 minutes and then the pod disappeared without logs.
Worse it kicked my jupyterhub pod as well as another user.
[I 2024-11-11 17:57:57.999 JupyterHub log:192] 200 GET /hub/error/503?url=%2Fuser%2Fasmacdo%2Fterminals%2Fwebsocket%2F1 (@100.64.247.104) 7.52ms
[W 2024-11-11 17:57:59.266 JupyterHub base:1254] User asmacdo server stopped, with exit code: 1
[I 2024-11-11 17:57:59.266 JupyterHub proxy:357] Removing user asmacdo from proxy (/user/asmacdo/)
I think this means we need to take a different approach. By setting resource limits, we should have isolated our job from the other pods, but since I have no other logs about what happened here I think we need to take a more conservative approach that is completely isolated from user pods.
I did it this way because I thought it would be simpler, but if theres any chance that we affect a running user pod, we would be better off directly deploying a separate EC2 instance and bind the EFS directly, avoiding Kubernetes altogether.
we would be better off directly deploying a separate EC2 instance and bind the EFS directly, avoiding Kubernetes altogether.
Thanks @asmacdo. That makes sense.
Fixes https://github.com/dandi/dandi-hub/issues/177
Step 1: Create Skeleton
I've verified that when a user-node is available (created by running a
tiny
jupyterhub), the job pod schedules on that node. I then shut down my jupyterhub and all user-nodes scaled down. I reran this job, and Karpenter successfully scaled up a new spot node, the pod was scheduled on it, ran successfully, was deleted, and the node cleaned up. Step 1 complete!Step 2 Generate Report
Step 3 Push Report
Questions to answer: