GoogleCloudPlatform / llm-pipeline-examples

Apache License 2.0
103 stars 24 forks source link

Stuck at creating H100 instances #87

Open jiangwei221 opened 1 month ago

jiangwei221 commented 1 month ago

I'm trying to run the example with 2 H100x8 nodes to test the DirectGPU-TCPX speed. There's some credential issues when calling "gsutil cp", so I've created a local docker image(h100launcher in the command below) with credential json. This is my command to start the provisioning process:

sudo docker run -it docker.io/library/h100launcher:latest $PROJECT_ID gcr.io/llm-containers/gpt_train:release gs://$BUCKET_NAME 0 0 0 ' {"data_file_name":"wiki_data_text_document", "tensor_parallel":4, "pi
peline_parallel":12, "nlayers":70, "hidden":14336, "heads":112, "seq_len":2048, "train_steps":100, "eval_steps":10, "micro_batch":1, "gradient_acc_steps":128 }' '{ "name_prefix": "megatron-gpt", "zone": "us-c
entral1-a", "node_count": 2, "machine_type": "a3-highgpu-8g", "gpu_type": "nvidia-h100-80gb", "gpu_count": 8 }'

I can see the network interfaces are created successfully in the firewall page, but it stuck at creating the instances:

...
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m0s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m10s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m20s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m30s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m40s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [26m50s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m0s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m10s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m20s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m30s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m40s elapsed]
module.compute_instance_group_manager[0].google_compute_instance_group_manager.mig: Still creating... [27m50s elapsed]

I should have enough H100 quotas. Do you have any suggestions on how to debug this? Thanks!

jiangwei221 commented 1 month ago

Error

The exact error on GCP webpage is following:

Instance 'tputryout-0-8js6' creation failed: The zone 'projects/tputryout/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'.  

Reason

After digging around, I found the region of the instance template is the culprit. The created instance template is actually in "global" region, and I can't create MIG from it. But once I manually move the template from "gloabl" to "us-central1", I can create MIG. Do you have any thought why the region of instance template would affect the compute limits? Btw, even I specify the region to "us-central1" in terraform, the created template is still in "global" region.

How to fix the error

Manually move the template from "gloabl" to local region such as "us-central1".