GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Apache License 2.0
225 stars 172 forks source link

RAG tf apply fail on AP cluster due to AP not scale up fast enough to deploy GMP #750

Open yiyinglovecoding opened 2 months ago

yiyinglovecoding commented 2 months ago

RAG terraform apply occasionally fail if when trying to deploy GMP but AP cluster has zero node at that time

Error: Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": no endpoints available for service "gmp-operator"

  with module.kuberay-monitoring.helm_release.gmp-engine,
  on ../../modules/kuberay-monitoring/main.tf line 21, in resource "helm_release" "gmp-engine":
  21: resource "helm_release" "gmp-engine" {

failed cloud build log

The cluster will be fully working mins later but QSS deployment is already marked as failed.