GoogleCloudPlatform / batch-samples

56 stars 26 forks source link

Requesting a Python/CUDA example #18

Open adelevie opened 2 years ago

adelevie commented 2 years ago

Hi,

The existing examples are very good. But given that the GPU/AI/ML features were highlighted in the introductory blog post ("Use accelerator-optimized resources."), it would be nice to see a full example here.

If it helps, I've tried this on my own, but got some errors:

{
    "taskGroups": [
        {
        "taskSpec": {
            "computeResource": {
                "cpuMilli": "20000",
                "memoryMib": "15000"
            },
            "runnables": [
          {
            "container": {
              "imageUri": "pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime",
                "entrypoint": "/bin/sh",
                "commands": ["-c", "python -c \"import torch;print(torch.cuda.is_available())\""]
            }
          }
            ],
            "maxRetryCount": 2,
            "maxRunDuration": "3600s"
        },
        "taskCount": 1,
        "parallelism": 1
        }
    ],
    "allocationPolicy": {
            "instances": [
                {
                    "instanceTemplate": "alan-test-instance-template-3"
                }
            ]
        },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

The log output is:

2022-08-18 09:31:08.760 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Reading package lists...
2022-08-18 09:31:08.772 EDT
Task action/STARTUP/0/0/group0/0, STDOUT:
2022-08-18 09:31:08.777 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Building dependency tree...
2022-08-18 09:31:08.904 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Reading state information...
2022-08-18 09:31:08.905 EDT
Task action/STARTUP/0/0/group0/0, STDOUT:
2022-08-18 09:31:08.954 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies:
2022-08-18 09:31:09.008 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: docker.io : Depends: runc (>= 1.0.0~rc6~)
2022-08-18 09:31:09.019 EDT
Task action/STARTUP/0/0/group0/0, STDERR: E: Unable to correct problems, you have held broken packages.

And for reference, here's the info for my instance template:

{
  "creationTimestamp": "2022-08-17T14:05:29.128-07:00",
  "description": "",
  "id": "[redacted]",
  "kind": "compute#instanceTemplate",
  "name": "alan-test-instance-template-3",
  "properties": {
    "confidentialInstanceConfig": {
      "enableConfidentialCompute": false
    },
    "description": "",
    "scheduling": {
      "onHostMaintenance": "TERMINATE",
      "provisioningModel": "STANDARD",
      "automaticRestart": true,
      "preemptible": false
    },
    "tags": {},
    "disks": [
      {
        "type": "PERSISTENT",
        "deviceName": "alan-test-instance-template-3",
        "autoDelete": true,
        "index": 0,
        "boot": true,
        "kind": "compute#attachedDisk",
        "mode": "READ_WRITE",
        "initializeParams": {
          "sourceImage": "projects/ml-images/global/images/c0-deeplearning-common-cu110-v20220806-debian-10",
          "diskType": "pd-balanced",
          "diskSizeGb": "100"
        }
      },
      {
        "type": "PERSISTENT",
        "deviceName": "persistent-disk-1",
        "autoDelete": false,
        "index": 1,
        "kind": "compute#attachedDisk",
        "mode": "READ_WRITE",
        "initializeParams": {
          "description": "",
          "diskType": "pd-balanced",
          "diskSizeGb": "100"
        }
      }
    ],
    "networkInterfaces": [
      {
        "name": "nic0",
        "network": "projects/[redacted]/global/networks/default",
        "accessConfigs": [
          {
            "name": "External NAT",
            "type": "ONE_TO_ONE_NAT",
            "kind": "compute#accessConfig",
            "networkTier": "PREMIUM"
          }
        ],
        "kind": "compute#networkInterface"
      }
    ],
    "reservationAffinity": {
      "consumeReservationType": "ANY_RESERVATION"
    },
    "canIpForward": false,
    "keyRevocationActionType": "NONE",
    "machineType": "n1-standard-4",
    "metadata": {
      "fingerprint": "[redacted]",
      "kind": "compute#metadata"
    },
    "shieldedVmConfig": {
      "enableSecureBoot": false,
      "enableVtpm": true,
      "enableIntegrityMonitoring": true
    },
    "shieldedInstanceConfig": {
      "enableSecureBoot": false,
      "enableVtpm": true,
      "enableIntegrityMonitoring": true
    },
    "serviceAccounts": [
      {
        "email": "[redacted]@developer.gserviceaccount.com",
        "scopes": [
          "https://www.googleapis.com/auth/devstorage.read_only",
          "https://www.googleapis.com/auth/logging.write",
          "https://www.googleapis.com/auth/monitoring.write",
          "https://www.googleapis.com/auth/servicecontrol",
          "https://www.googleapis.com/auth/service.management.readonly",
          "https://www.googleapis.com/auth/trace.append"
        ]
      }
    ],
    "guestAccelerators": [
      {
        "acceleratorCount": 1,
        "acceleratorType": "nvidia-tesla-t4"
      }
    ],
    "displayDevice": {
      "enableDisplay": false
    }
  },
  "selfLink": "projects/[redacted]/global/instanceTemplates/alan-test-instance-template-3"
}

EDIT: Digging through the Job spec to the ComputeResource spec, I see the following:

gpuCount | string (int64 format)The GPU count.Not yet implemented.
-- | --

gpuCount    
string ([int64](https://developers.google.com/discovery/v1/type-format) format)

The GPU count.

Not yet implemented.

Does this imply GPU jobs are not yet supported?

lripoche commented 1 year ago

Does this imply GPU jobs are not yet supported?

According to the doc, yes it is. Unfortunately I can't make the container job example work. The base container image is downloaded, the nvidia drivers are installed but the command is not executed and the job exits with error.

GPU count can be set with this syntax.

aaronegolden commented 2 months ago

There is now a dogs vs. cats CNN training example here, which uses PyTorch and acceleration via CUDA.

GPU jobs (containerized or not) are supported in general, and Batch will automatically install drivers (when the installGpuDrivers flag is set in the job spec) and will automatically set the necessary docker options to give containers access to the GPU(s) for container runnables.

I've tested the new sample (this one) recently so it should be a reliable template for other PyTorch/CUDA jobs. Please let me know if you run into any issues.

kesitrifork commented 2 weeks ago

It seems a bit under documented, I can't seem to get it to work without this config, I have tried every combination of this config, and everything other fails:

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "script": {
              "text": "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml && sudo nvidia-ctk runtime configure --runtime=docker --cdi.enabled && sudo systemctl restart docker"
            }
          },
          {
            "container": {
              "imageUri": "ultralytics/ultralytics:8.2.7",
              "commands": ["/var/lib/nvidia/bin/nvidia-smi"],
              "volumes": ["/var/lib/nvidia/bin:/var/lib/nvidia/bin:ro"],
              "options": "--runtime=nvidia --network=host"
            }
          }
        ],
        "computeResource": {
          "cpuMilli": 1000,
          "memoryMib": 1000
        },
        "maxRetryCount": 2,
        "maxRunDuration": "600s"
      },
      "taskCount": 1,
      "parallelism": 1
    }
  ],
  "allocationPolicy": {
    "instances": [
      {
        "installGpuDrivers": true,
        "policy": {
          "machineType": "n1-standard-2",
          "accelerators": [
            {
              "type": "nvidia-tesla-t4",
              "count": 1
            }
          ],
          "bootDisk": {
            "type": "pd-balanced",
            "sizeGb": "30",
            "image": "projects/batch-custom-image/global/images/family/batch-cos-stable-official"
          }
        }
      }
    ]
  },
  "labels": {
    "department": "creative",
    "environment": "dev"
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}
image