GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Apache License 2.0
194 stars 143 forks source link

TPU provisioner cloudbuild step #660

Closed danielvegamyhre closed 2 months ago

danielvegamyhre commented 2 months ago

Fixes #659

danielvegamyhre commented 2 months ago

/gcbrun

danielvegamyhre commented 2 months ago

/gcbrun

danielvegamyhre commented 2 months ago

/gcbrun

danielvegamyhre commented 2 months ago

/gcbrun

danielvegamyhre commented 2 months ago

/gcbrun

andrewsykim commented 2 months ago

/gcbrun

danielvegamyhre commented 2 months ago

/gcbrun

danielvegamyhre commented 2 months ago

@andrewsykim this is ready for another look

andrewsykim commented 2 months ago

confirmed tests passed in cloudbuild:

mkdir -p /workspace/tpu-provisioner/bin
test -s /workspace/tpu-provisioner/bin/controller-gen && /workspace/tpu-provisioner/bin/controller-gen --version | grep -q v0.11.1 || \
GOBIN=/workspace/tpu-provisioner/bin go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.11.1
go: downloading sigs.k8s.io/controller-tools v0.11.1
go: downloading github.com/spf13/cobra v1.6.1
go: downloading golang.org/x/tools v0.4.0
go: downloading gopkg.in/yaml.v2 v2.4.0
go: downloading github.com/fatih/color v1.13.0
go: downloading k8s.io/api v0.26.0
go: downloading k8s.io/apimachinery v0.26.0
go: downloading gopkg.in/yaml.v3 v3.0.1
go: downloading k8s.io/apiextensions-apiserver v0.26.0
go: downloading sigs.k8s.io/yaml v1.3.0
go: downloading github.com/gobuffalo/flect v0.3.0
go: downloading github.com/mattn/go-colorable v0.1.9
go: downloading github.com/mattn/go-isatty v0.0.14
go: downloading github.com/spf13/pflag v1.0.5
go: downloading github.com/gogo/protobuf v1.3.2
go: downloading k8s.io/utils v0.0.0-20221107191617-1a15be271d1d
go: downloading github.com/google/gofuzz v1.1.0
go: downloading k8s.io/klog/v2 v2.80.1
go: downloading sigs.k8s.io/structured-merge-diff/v4 v4.2.3
go: downloading golang.org/x/sys v0.3.0
go: downloading sigs.k8s.io/json v0.0.0-20220713155537-f223a00ba0e2
go: downloading gopkg.in/inf.v0 v0.9.1
go: downloading github.com/go-logr/logr v1.2.3
go: downloading github.com/json-iterator/go v1.1.12
go: downloading golang.org/x/net v0.4.0
go: downloading golang.org/x/mod v0.7.0
go: downloading github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd
go: downloading github.com/modern-go/reflect2 v1.0.2
go: downloading golang.org/x/text v0.5.0
/workspace/tpu-provisioner/bin/controller-gen rbac:roleName=manager-role webhook paths="./..."
go fmt ./...
go vet ./...
go: downloading github.com/onsi/gomega v1.32.0
go: downloading github.com/onsi/ginkgo/v2 v2.17.1
test -s /workspace/tpu-provisioner/bin/setup-envtest || GOBIN=/workspace/tpu-provisioner/bin go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest
go: downloading sigs.k8s.io/controller-runtime/tools/setup-envtest v0.0.0-20240507051437-479b723944e3
go: downloading sigs.k8s.io/controller-runtime v0.18.2
go: downloading github.com/go-logr/logr v1.2.4
go: downloading github.com/spf13/afero v1.6.0
go: downloading github.com/go-logr/zapr v1.2.4
go: downloading go.uber.org/multierr v1.10.0
go: downloading golang.org/x/text v0.12.0
curl -L https://github.com/kubernetes-sigs/jobset/releases/download/"v0.5.0"/manifests.yaml > test/crds/jobset-"v0.5.0".yaml
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100  641k  100  641k    0     0  1942k      0 --:--:-- --:--:-- --:--:-- 1942k
KUBEBUILDER_ASSETS="/workspace/tpu-provisioner/bin/k8s/1.26.0-linux-amd64" go test ./... -v -coverprofile cover.out
    github.com/GoogleCloudPlatform/ai-on-gke/tpu-provisioner/cmd        coverage: 0.0% of statements
    github.com/GoogleCloudPlatform/ai-on-gke/tpu-provisioner/internal/controller        coverage: 0.0% of statements
=== RUN   TestHelperProcess
--- PASS: TestHelperProcess (0.00s)
=== RUN   Test_isCmdTokenSource
--- PASS: Test_isCmdTokenSource (0.00s)
=== RUN   Test_tokenSource_cmd
--- PASS: Test_tokenSource_cmd (0.00s)
=== RUN   Test_tokenSource_cmdCannotBeUsedWithScopes
--- PASS: Test_tokenSource_cmdCannotBeUsedWithScopes (0.00s)
=== RUN   Test_tokenSource_applicationDefaultCredentials_fails
--- PASS: Test_tokenSource_applicationDefaultCredentials_fails (0.00s)
=== RUN   Test_tokenSource_applicationDefaultCredentials
--- PASS: Test_tokenSource_applicationDefaultCredentials (0.00s)
=== RUN   Test_parseScopes
--- PASS: Test_parseScopes (0.00s)
=== RUN   TestCmdTokenSource
--- PASS: TestCmdTokenSource (0.03s)
=== RUN   TestCachedTokenSource
--- PASS: TestCachedTokenSource (0.00s)
=== RUN   Test_cmdTokenSource_roundTrip
--- PASS: Test_cmdTokenSource_roundTrip (0.00s)
PASS
coverage: 89.3% of statements
ok      github.com/GoogleCloudPlatform/ai-on-gke/tpu-provisioner/internal/auth/gcp  0.033s  coverage: 89.3% of statements
=== RUN   Test_tpuTopologyToNodeCount
=== RUN   Test_tpuTopologyToNodeCount/tpu-v4-podslice_2x2x1
=== RUN   Test_tpuTopologyToNodeCount/tpu-v4-podslice_2x2x2
=== RUN   Test_tpuTopologyToNodeCount/tpu-v5p-slice_2x2x2
=== RUN   Test_tpuTopologyToNodeCount/tpu-v4-podslice_2x2x4
=== RUN   Test_tpuTopologyToNodeCount/tpu-v5p-slice_2x2x4
=== RUN   Test_tpuTopologyToNodeCount/tpu-v4-podslice_2x4x4
=== RUN   Test_tpuTopologyToNodeCount/tpu-v5-lite-podslice_2x4
=== RUN   Test_tpuTopologyToNodeCount/not-an-accel_2x4
=== RUN   Test_tpuTopologyToNodeCount/tpu-v4-podslice_not-a-topo
--- PASS: Test_tpuTopologyToNodeCount (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/tpu-v4-podslice_2x2x1 (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/tpu-v4-podslice_2x2x2 (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/tpu-v5p-slice_2x2x2 (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/tpu-v4-podslice_2x2x4 (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/tpu-v5p-slice_2x2x4 (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/tpu-v4-podslice_2x4x4 (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/tpu-v5-lite-podslice_2x4 (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/not-an-accel_2x4 (0.00s)
    --- PASS: Test_tpuTopologyToNodeCount/tpu-v4-podslice_not-a-topo (0.00s)
=== RUN   Test_tpuMachineType
=== RUN   Test_tpuMachineType/tpu-v4-podslice_accel_4_tpus
=== RUN   Test_tpuMachineType/tpu-v5-lite-podslice_accel_1_tpus
=== RUN   Test_tpuMachineType/tpu-v5-lite-podslice_accel_4_tpus
=== RUN   Test_tpuMachineType/tpu-v5-lite-podslice_accel_8_tpus
=== RUN   Test_tpuMachineType/tpu-v5p-slice_accel_4_tpus
=== RUN   Test_tpuMachineType/not-an-accel_accel_4_tpus
=== RUN   Test_tpuMachineType/tpu-v5p-slice_accel_-1_tpus
--- PASS: Test_tpuMachineType (0.00s)
    --- PASS: Test_tpuMachineType/tpu-v4-podslice_accel_4_tpus (0.00s)
    --- PASS: Test_tpuMachineType/tpu-v5-lite-podslice_accel_1_tpus (0.00s)
    --- PASS: Test_tpuMachineType/tpu-v5-lite-podslice_accel_4_tpus (0.00s)
    --- PASS: Test_tpuMachineType/tpu-v5-lite-podslice_accel_8_tpus (0.00s)
    --- PASS: Test_tpuMachineType/tpu-v5p-slice_accel_4_tpus (0.00s)
    --- PASS: Test_tpuMachineType/not-an-accel_accel_4_tpus (0.00s)
    --- PASS: Test_tpuMachineType/tpu-v5p-slice_accel_-1_tpus (0.00s)
=== RUN   TestPodToNodePoolName
=== RUN   TestPodToNodePoolName/Missing_JobSetName_label
=== RUN   TestPodToNodePoolName/Missing_JobKey_label
=== RUN   TestPodToNodePoolName/jobset_name_less_than_34_chars
=== RUN   TestPodToNodePoolName/jobset_name_more_than_34_chars
--- PASS: TestPodToNodePoolName (0.00s)
    --- PASS: TestPodToNodePoolName/Missing_JobSetName_label (0.00s)
    --- PASS: TestPodToNodePoolName/Missing_JobKey_label (0.00s)
    --- PASS: TestPodToNodePoolName/jobset_name_less_than_34_chars (0.00s)
    --- PASS: TestPodToNodePoolName/jobset_name_more_than_34_chars (0.00s)
=== RUN   TestNodePoolForPod
=== RUN   TestNodePoolForPod/simple_case
=== RUN   TestNodePoolForPod/pod_with_reservation_selector
=== RUN   TestNodePoolForPod/pod_with_disabling_ICI_resiliency_selector
=== RUN   TestNodePoolForPod/pod_with_secondary_boot_disk
--- PASS: TestNodePoolForPod (0.00s)
    --- PASS: TestNodePoolForPod/simple_case (0.00s)
    --- PASS: TestNodePoolForPod/pod_with_reservation_selector (0.00s)
    --- PASS: TestNodePoolForPod/pod_with_disabling_ICI_resiliency_selector (0.00s)
    --- PASS: TestNodePoolForPod/pod_with_secondary_boot_disk (0.00s)
PASS
coverage: 42.4% of statements
ok      github.com/GoogleCloudPlatform/ai-on-gke/tpu-provisioner/internal/cloud 0.019s  coverage: 42.4% of statements
=== RUN   TestAPIs
Running Suite: Controller Suite - /workspace/tpu-provisioner/test/integration/controller
========================================================================================
Random Seed: 1715098042

Will run 9 of 9 specs
•••••••••W0507 16:08:22.233667   18551 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232: watch of *v1alpha2.JobSet ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

Ran 9 of 9 Specs in 61.254 seconds
SUCCESS! -- 9 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAPIs (61.25s)
PASS
coverage: 55.1% of statements
ok      github.com/GoogleCloudPlatform/ai-on-gke/tpu-provisioner/test/integration/controller    61.282s coverage: 55.1% of statements
PUSH
DONE
andrewsykim commented 2 months ago

Not sure if you want to look into this warning:

W0507 16:08:22.233667   18551 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232: watch of *v1alpha2.JobSet ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding