The purpose of this tool is to provide a very quick and simple way to provision Google Cloud Platform (GCP) compute clusters of specifically accelerator optimized machines.
Feature \ Machine | A2 | A3 |
---|---|---|
Nvidia GPU Type | A100 -- 40GB and 80GB | H100 80GB |
VM Shapes | Several | 8 GPUs |
GPUDirect-TCPX | Unsupported | Supported |
Multi-NIC | Unsupported | 5 vNICS -- 1 for CPU and 4 for GPUs (one per pair of GPUs) |
This repository contains:
tfvars
) file, and upload all logs to the GCS
backend bucket.us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image
-- that has all necessary tools installed which calls the entrypoint script
and creates a cluster for you.In order to provision a cluster, the following are required:
roles/editor
.gcloud
authorization: explained below.The command to authorize tools to create resources on your behalf is:
gcloud auth application-default login
The above command is:
-v "${HOME}/.config/gcloud:/root/.config/gcloud"
flag (explained below). Without this, the tool will
prompt you on every invocation to authorize itself to create GCP resources
for you.After running through the prerequisites above, there are a few ways to provision a cluster:
terraform apply
create this cluster along with all your other
infrastructure.ghpc deploy
create this cluster along with all your other infrastructure.For this method, all you need (in addition to the above requirements) is a
terraform.tfvars
file (user generated or copied from an example --
a3-mega) in your current directory and the ability to run
docker. In a terminal, run:
# create/update the cluster
docker run \
--rm \
-v "${HOME}/.config/gcloud:/root/.config/gcloud" \
-v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
create a3-mega mig-cos
# destroy the cluster
docker run \
--rm \
-v "${HOME}/.config/gcloud:/root/.config/gcloud" \
-v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
destroy a3-mega mig-cos
Quick explanation of the docker run
flags (in same order as above):
-v "${HOME}/.config/gcloud:/root/.config/gcloud"
exposes gcloud credentials
to the container so that it can access your GCP project.-v "${PWD}:/root/aiinfra/input"
exposes the current working directory to
the container so the tool can read the terraform.tfvars
file.create/destroy
tells the tool whether it should create or destroy the whole
cluster.a3-mega
specifies which type of cluster to provision -- this will influence mainly machine type, networking, and startup scripts.mig-cos
tells the tool to create a Managed Instance Group and
start a container at boot.For this method, you need to
install terraform.
Examples of usage as a terraform module can be found in the main.tf
files in
any of the examples -- a3-mega. Cluster provisioning then happens
the same as any other terraform:
# assuming the directory containing main.tf is the current working directory
# create/update the cluster
terraform init && terraform validate && terraform apply -var-file="terraform.tfvars"
# destroy the cluster
terraform init && terraform validate && terraform apply -destroy
For this method, you need to
build ghpc.
Examples of usage as an HPC Toolkit Blueprint can be found in the
blueprint.yaml
files in any of the examples -- a3-mega. Cluster
provisioning then happens the same as any blueprint:
# assuming the ghpc binary and blueprint.yaml are both in
# the current working directory
# create/update the cluster
./ghpc create -w ./blueprint.yaml && ./ghpc deploy a3-mega-mig-cos
# destroy the cluster
./ghpc create -w ./blueprint.yaml && ./ghpc destroy a3-mega-mig-cos