Open jiangwei221 opened 1 month ago
The exact error on GCP webpage is following:
Instance 'tputryout-0-8js6' creation failed: The zone 'projects/tputryout/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'.
After digging around, I found the region of the instance template is the culprit. The created instance template is actually in "global" region, and I can't create MIG from it. But once I manually move the template from "gloabl" to "us-central1", I can create MIG. Do you have any thought why the region of instance template would affect the compute limits? Btw, even I specify the region to "us-central1" in terraform, the created template is still in "global" region.
Manually move the template from "gloabl" to local region such as "us-central1".
I'm trying to run the example with 2 H100x8 nodes to test the DirectGPU-TCPX speed. There's some credential issues when calling "gsutil cp", so I've created a local docker image(h100launcher in the command below) with credential json. This is my command to start the provisioning process:
I can see the network interfaces are created successfully in the firewall page, but it stuck at creating the instances:
I should have enough H100 quotas. Do you have any suggestions on how to debug this? Thanks!