GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Apache License 2.0
211 stars 154 forks source link

Parallel all the applications within cloudbuild test #620

Closed chiayi closed 4 months ago

chiayi commented 4 months ago

Parallel the application builds. Run Jupyter/Rag/Ray at the same time in difference namespaces.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

Jupyter seems to be running into some issue with waiting for pods to be ready. Currently happening when rag is being installed, although I'm unsure if it's related and can't seem to repro locally.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

Error mentioned in this comment: https://github.com/GoogleCloudPlatform/ai-on-gke/pull/620#issuecomment-2080202976 seems to be inconsistent. Another issue that seems to occur with cloud-build is Jupyter unable to destroy properly.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

Cloud-build was failing due to the port for Rag being changed. It was changed to prevent Jupyter/Rag/Ray sharing the same port. Changing back to see if it succeeds.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

Build takes around 50 mins. Going to rerun.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

andrewsykim commented 4 months ago

The build time for pr-review-trigger is 52m, I was expecting a larger improvement from this change :thinking:

chiayi commented 4 months ago

Looking at all the past runs, the cleanup would sometimes not completely succeed (all from standard cluster). Going to run one more time to see if it occurs.

chiayi commented 4 months ago

/gcbrun

andrewsykim commented 4 months ago

Looking at all the past runs, the cleanup would sometimes not completely succeed (all from standard cluster). Going to run one more time to see if it occurs.

I don't think clean up ever succeeded fwiw, there's some known bugs there

chiayi commented 4 months ago

Looking at all the past runs, the cleanup would sometimes not completely succeed (all from standard cluster). Going to run one more time to see if it occurs.

I don't think clean up ever succeeded fwiw, there's some known bugs there

Oh I was specifically talking about the Jupyterhub cleanup. success: https://pantheon.corp.google.com/cloud-build/builds;region=us-central1/6e09787a-1d3e-4509-8c42-fe4cb3d0b0d4?e=-13802955&mods=prod_coliseum&project=gke-ai-eco-dev and https://pantheon.corp.google.com/cloud-build/builds;region=us-central1/fefbdae1-a82e-4b2a-884e-f2a8a5bc5cc2?e=-13802955&mods=prod_coliseum&project=gke-ai-eco-dev.

fail: https://pantheon.corp.google.com/cloud-build/builds;region=us-central1/a3b49edc-e6a9-459a-9792-f017cebe9d87?e=-13802955&mods=prod_coliseum&project=gke-ai-eco-dev

chiayi commented 4 months ago

And I guess the latest run as well: https://pantheon.corp.google.com/cloud-build/builds;region=us-central1/110ac62c-99c9-4a8c-b4b5-4dcca73193c5;step=8?e=-13802955&mods=prod_coliseum&project=gke-ai-eco-dev

chiayi commented 4 months ago

/gcbrun

andrewsykim commented 4 months ago

Oh I was specifically talking about the Jupyterhub cleanup.

Oh my bad, yeah this seems like a new failure

chiayi commented 4 months ago

I couldn't repro on local. Printing kubectl events to see if there is anything there.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

andrewsykim commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

RAG frontend isn't coming up with error: Waiting for rollout to finish: 3 replicas wanted; 2 replicas Ready

chiayi commented 4 months ago

Cleaned up all the unnecessary kubectl commands. Will also try test again.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

Same error

RAG frontend isn't coming up with error: Waiting for rollout to finish: 3 replicas wanted; 2 replicas Ready

Adding terraform debug and kubectl event

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

Cleaned up the rest of the kubectl events/get pods commands that was used for debugging.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

Going to run again, the test failed due to cluster not being able to be created properly.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

Going to leak the cluster if there was a failure in Ray or Jupyter.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

I'm noticing that the frontend 3rd replica takes a bit longer to start up since the image takes bit longer to pull. I'm extending the timeout for create of the frontend to address this. As for why image pulling is taking longer, it will require a deeper dive into why it takes an avg of ~7m more for the 3rd replica. This is only occurring for standard.

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

/gcbrun

chiayi commented 4 months ago

I am now confused. Error is pretty inconsistent for some reason.

chiayi commented 4 months ago

/gcbrun