Unable to reuse previously provisioned cluster

Scenario: first run training cluster was provisioned, deepspeed executed but training failed. Cluster remained.

Trying to reuse the cluster for a rerun, encountered following problems:

pipeline.py don't seem to be able to find the existing cluster.
error in run_batch.sh script for "gcloud compute instances list" - Required 'compute.instances.list' permission for 'projects/e64a316e7aac5aacfp-tp'

I was able to get the script to recognize my previous cluster after my mods (investigation below), but now the ssh fails unable issue the docker commands to restart the training. It would seem if failed at this line

https://github.com/GoogleCloudPlatform/llm-pipeline-examples/blob/9603e557ed069f9bd338ffca77ccfe35d0f651f1/scripts/train/run_batch.sh#L160

What am I missing? advice appreciated!

Cluster found! Exporting machine list...
Copying file://machines.txt [Content-Type=text/plain]...
/ [0 files][    0.0 B/   27.0 B]
/ [1 files][   27.0 B/   27.0 B]
Operation completed over 1 objects/27.0 B.                                       
Restarting training on VMs...
CommandException: No URLs matched: gs://MY_BUCKET/pipeline_runs/PROJECT_ID/llm-pipeline-20230924223600/train_-5175810662584025088/model/progress.txt
Copying gs://MY_BUCKET/pipeline_runs/PROJECT_ID/llm-pipeline-20230924223600/train_-5175810662584025088/model/machines.txt...
/ [0 files][    0.0 B/   27.0 B]
/ [1 files][   27.0 B/   27.0 B]
Operation completed over 1 objects/27.0 B.                                       
WARNING: The private SSH key file for gcloud does not exist.
WARNING: The public SSH key file for gcloud does not exist.
WARNING: You do not have an SSH key for gcloud.
WARNING: SSH keygen will be executed to generate a key.
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/google_compute_engine
Your public key has been saved in /root/.ssh/google_compute_engine.pub
The key fingerprint is:
SHA256:xxxxxxxxxxx
The key's randomart image is:
+---[RSA 3072]----+
...
+----[SHA256]-----+
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

Investigations

1) Cluster recognition:

https://github.com/GoogleCloudPlatform/llm-pipeline-examples/blob/9603e557ed069f9bd338ffca77ccfe35d0f651f1/pipeline.py#L269

It is the time suffix that causes the mismatch when the code looks for the job. Setting id="" will allow matching. If there are more than one clusters and only one is intended to be found, we can use the "name_prefix" field to provide a prefix that matches just one cluster. Perhaps we should add this to documentation or add a flag to pipeline.py for reuse.

https://github.com/GoogleCloudPlatform/llm-pipeline-examples/blob/9603e557ed069f9bd338ffca77ccfe35d0f651f1/configs/small1vm1gpu.json#L7

2) run_batch.sh, I found that there are multiple issues:

adding --project=${PROJECT} to the following lines solved the unknown project issue:

In fact, I use the follow with explicit filter and headers to make sure the regexp will work:

gcloud compute instances list --project=${PROJECT} --filter='STATUS=RUNNING' --format='csv[no-heading,separator=" "](NAME,ZONE,MACHINE_TYPE,PREEMPTIBLE,INTERNAL_IP,EXTERNAL_IP,STATUS)' | grep ${JOB_ID} | sed 's/\(\S\+\) .* \([0-9\.]\+\)[0-9\.,]* \([0-9\.]\+\)\? RUNNING/\1 \2/' | sort | head -n ${NODE_COUNT} > machines.txt

and also changed from nvidia-persistenced because it failed in my VM:

export PRE_DOCKER_RUN="nvidia-smi -pm 1;"

NOTE for devs reading at this point: run_batch.sh is inside the batch container, changing it means you would need to rebuild and use your own rather than use the supplied gcr.io one

docker build . -t ${YOUR IMAGE TAG} -f docker/batch.Dockerfile
docker push ${YOUR IMAGE TAG} 

# components/trainer.yaml - make sure you reference it before running pipeline.py

implementation:
  container:
    image: {YOUR IMAGE TAG}

GoogleCloudPlatform / llm-pipeline-examples

Unable to reuse previously provisioned cluster #69

Investigations