GoogleCloudPlatform / cluster-toolkit

Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
Apache License 2.0
196 stars 137 forks source link

llama2-finetuning-slurm YAML blueprint: schedmd-slurm-gcp-v7-partition not found #3149

Open xibinliu opened 3 days ago

xibinliu commented 3 days ago

Describe the bug

gcluster create hpc-slurm-llama2.yaml failed

Steps to reproduce

Steps to reproduce the behavior:

  1. Install the latest cluster-toolkit

  2. Clone the scientific-computing-examples and create the installation directory for llama2-finetuning-slurm

> cd scientific-computing-examples/llama2-finetuning-slurm
> gcluster create hpc-slurm-llama2.yaml --vars project_id=$(gcloud config get-value project) -w --vars bucket_model=llama2

Expected behavior

The command should be completed successfully.

Actual behavior

Your active configuration is: [cloudshell-13783]
Error: failed to get info using tfconfig for terraform module at community/modules/compute/schedmd-slurm-gcp-v7-partition: source is not a terraform or packer module: community/modules/compute/schedmd-slurm-gcp-v7-partition
375:     source: community/modules/compute/schedmd-slurm-gcp-v7-partition
         ^

Version (gcluster --version)

> gcluster --version
gcluster version v1.40.1
Built from 'main' branch.
Commit info: v1.40.1-0-geb002543
Terraform version: 1.5.7

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

scientific-computing-examples/llama2-finetuning-slurm/hpc-slurm-llama2.yaml

Execution environment

Aryido commented 17 hours ago

https://github.com/GoogleCloudPlatform/scientific-computing-examples/issues/99

same issue

harshthakkar01 commented 2 hours ago

There was a typo in https://github.com/GoogleCloudPlatform/scientific-computing-examples/commit/edfaf529d3b130cae5e80891ecc43f151154ef9c

Cluster Toolkit doesn't have v7 modules as of now. Discussed with the Author. This should be fixed today.