AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

Check cluster arguments and update nodepools in existing cluster when requesting different device_type #120

Closed SurbhiJainUSC closed 6 months ago

SurbhiJainUSC commented 7 months ago

Fixes / Features

Testing / Documentation

Scenario 1: Cluster already has 4 nodepools of v4-8 in us-central2-b and now we request 2 nodepools of v4-8 in us-central2-b. The end state of the cluster will be 2 nodepools of v4-8 in us-central2-b.

Scneraio 2: Cluster has 2 nodepools of v4-8 in us-central2-b and now we request 4 nodepools of v4-8 in us-central2-b. The end state of the cluster will be 4 nodepools of v4-8 in us-central2-b.

Scenario 3: Cluster already has 2 nodepools of v4-8 in us-central2-b and now we request 2 nodepools of v4-16 in us-central2-b. The end state of the cluster will be 2 nodepools of v4-16 in us-central2-b.

Scenario 4: Cluster already has 2 nodepools of v4-8 in us-central2-b and now we request 3 nodepools of v4-16 in us-central2-b. The end state of the cluster will be 3 nodepools of v4-16 in us-central2-b.

Scenario 5: Cluster already has 2 nodepools of v4-8 in us-central2-b and now we request 2 nodepools of v4-8 in us-central2-a. XPK will fail early and will not update the cluster. The end state of the cluster will be 2 nodepools of v4-8 in us-central2-b.