GH action to generate report

asmacdo commented 1 month ago

Fixes https://github.com/dandi/dandi-hub/issues/177

Step 1: Create Skeleton

[X] Authenticate with AWS
[X] Connect to K8s cluster
[X] deploys our job-runner pod onto a Karpenter NodeClaim
[X] Creates a SPOT node as needed
[X] Run dummy job
[X] Delete Pod
[X] Scale Down

I've verified that when a user-node is available (created by running a tiny jupyterhub), the job pod schedules on that node. I then shut down my jupyterhub and all user-nodes scaled down. I reran this job, and Karpenter successfully scaled up a new spot node, the pod was scheduled on it, ran successfully, was deleted, and the node cleaned up. Step 1 complete!

Step 2 Generate Report

[ ] Connect Pod to EFS
[ ] List users
[ ] du each user
[ ] du shared
[ ] collate data into report
[ ] Double Check that nodes come up and down successfully
[ ] Run job several times in 1 day, check next day for EFS usage spike (IIUC we should be fine because EFS is Bursting mode)

Step 3 Push Report

[ ] Create private GitHub repository to store reports
[ ] Configure bot permission to push to repo
[ ] push report to repo on complete

Questions to answer:

If a SPOT node is preempted, can we redeploy again later?

kabilar commented 2 weeks ago

Hi @asmacdo, just checking in to see how this report generation is going? Thanks.

asmacdo commented 1 week ago

Ran into some problems

The du script ran for about 50 minutes and then the pod disappeared without logs.

Worse it kicked my jupyterhub pod as well as another user.

[I 2024-11-11 17:57:57.999 JupyterHub log:192] 200 GET /hub/error/503?url=%2Fuser%2Fasmacdo%2Fterminals%2Fwebsocket%2F1 (@100.64.247.104) 7.52ms
[W 2024-11-11 17:57:59.266 JupyterHub base:1254] User asmacdo server stopped, with exit code: 1
[I 2024-11-11 17:57:59.266 JupyterHub proxy:357] Removing user asmacdo from proxy (/user/asmacdo/)

I think this means we need to take a different approach. By setting resource limits, we should have isolated our job from the other pods, but since I have no other logs about what happened here I think we need to take a more conservative approach that is completely isolated from user pods.

I did it this way because I thought it would be simpler, but if theres any chance that we affect a running user pod, we would be better off directly deploying a separate EC2 instance and bind the EFS directly, avoiding Kubernetes altogether.

kabilar commented 1 week ago

we would be better off directly deploying a separate EC2 instance and bind the EFS directly, avoiding Kubernetes altogether.

Thanks @asmacdo. That makes sense.

dandi / dandi-hub