ai-cfia / howard

The Howard project, named after "The Godfather of Clouds" Luke Howard, orchestrates the Kubernetes-based cloud infrastructure for the Canadian Food Inspection Agency's AI lab, managing applications like Nachet, Finesse, and Louis. It prioritizes robustness, security and efficiency
https://ai-cfia.github.io/howard/
MIT License
3 stars 0 forks source link

Test and validate cluster performance #208

Closed ThomasCardin closed 5 months ago

ThomasCardin commented 5 months ago

TODO:

ThomasCardin commented 5 months ago

Done using this example from https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#run-a-gpu-enabled-workload

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    app: samples-tf-mnist-demo
  name: samples-tf-mnist-demo
spec:
  template:
    metadata:
      labels:
        app: samples-tf-mnist-demo
    spec:
      containers:
      - name: samples-tf-mnist-demo
        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
        args: ["--max_steps", "500"]
        imagePullPolicy: IfNotPresent
        resources:
          limits:
           nvidia.com/gpu: 1
      restartPolicy: OnFailure
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"