Closed chiayi closed 4 months ago
/gcbrun
/gcbrun
/gcbrun
/gcbrun
Jupyter seems to be running into some issue with waiting for pods to be ready. Currently happening when rag is being installed, although I'm unsure if it's related and can't seem to repro locally.
/gcbrun
Error mentioned in this comment: https://github.com/GoogleCloudPlatform/ai-on-gke/pull/620#issuecomment-2080202976 seems to be inconsistent. Another issue that seems to occur with cloud-build is Jupyter unable to destroy properly.
/gcbrun
/gcbrun
/gcbrun
/gcbrun
Cloud-build was failing due to the port for Rag being changed. It was changed to prevent Jupyter/Rag/Ray sharing the same port. Changing back to see if it succeeds.
/gcbrun
Build takes around 50 mins. Going to rerun.
/gcbrun
/gcbrun
The build time for pr-review-trigger
is 52m, I was expecting a larger improvement from this change :thinking:
Looking at all the past runs, the cleanup would sometimes not completely succeed (all from standard cluster). Going to run one more time to see if it occurs.
/gcbrun
Looking at all the past runs, the cleanup would sometimes not completely succeed (all from standard cluster). Going to run one more time to see if it occurs.
I don't think clean up ever succeeded fwiw, there's some known bugs there
Looking at all the past runs, the cleanup would sometimes not completely succeed (all from standard cluster). Going to run one more time to see if it occurs.
I don't think clean up ever succeeded fwiw, there's some known bugs there
Oh I was specifically talking about the Jupyterhub cleanup. success: https://pantheon.corp.google.com/cloud-build/builds;region=us-central1/6e09787a-1d3e-4509-8c42-fe4cb3d0b0d4?e=-13802955&mods=prod_coliseum&project=gke-ai-eco-dev and https://pantheon.corp.google.com/cloud-build/builds;region=us-central1/fefbdae1-a82e-4b2a-884e-f2a8a5bc5cc2?e=-13802955&mods=prod_coliseum&project=gke-ai-eco-dev.
/gcbrun
Oh I was specifically talking about the Jupyterhub cleanup.
Oh my bad, yeah this seems like a new failure
I couldn't repro on local. Printing kubectl events to see if there is anything there.
/gcbrun
/gcbrun
/gcbrun
/gcbrun
RAG frontend isn't coming up with error: Waiting for rollout to finish: 3 replicas wanted; 2 replicas Ready
Cleaned up all the unnecessary kubectl
commands. Will also try test again.
/gcbrun
Same error
RAG frontend isn't coming up with error:
Waiting for rollout to finish: 3 replicas wanted; 2 replicas Ready
Adding terraform debug and kubectl event
/gcbrun
/gcbrun
Cleaned up the rest of the kubectl events/get pods
commands that was used for debugging.
/gcbrun
Going to run again, the test failed due to cluster not being able to be created properly.
/gcbrun
Going to leak the cluster if there was a failure in Ray or Jupyter.
/gcbrun
/gcbrun
/gcbrun
/gcbrun
/gcbrun
I'm noticing that the frontend 3rd replica takes a bit longer to start up since the image takes bit longer to pull. I'm extending the timeout for create of the frontend to address this. As for why image pulling is taking longer, it will require a deeper dive into why it takes an avg of ~7m more for the 3rd replica. This is only occurring for standard.
/gcbrun
/gcbrun
I am now confused. Error is pretty inconsistent for some reason.
/gcbrun
Parallel the application builds. Run Jupyter/Rag/Ray at the same time in difference namespaces.