davidADSP / Generative_Deep_Learning_2nd_Edition

The official code repository for the second edition of the O'Reilly book Generative Deep Learning: Teaching Machines to Paint, Write, Compose and Play.
https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/
Apache License 2.0
990 stars 363 forks source link

Suggestion: use Kubernetes with GKE Autopilot instead of VMs to run book examples on a cloud GPU #17

Open AlexBulankou opened 1 year ago

AlexBulankou commented 1 year ago

This repo provides instructions on how to set up GCP cloud VM instance with GPU to run examples. I would like to recommend to take it further and use GKE Autopilot for GPU workloads instead of VMs. Some benefits are:

This is how I deployed the examples to GKE Autopilot:

  1. Build and deploy docker image:
IMAGE=<your_image> # you can also skip this step and use bulankou/gdl2:20230715 that I build
docker build -f ./docker/Dockerfile.gpu -t $IMAGE .
docker push $IMAGE .
  1. Create GKE Autopilot cluster with all default settings.
  2. Apply the following K8s manifest (kubectl apply -f <yaml>) . Make sure to update <IMAGE> below. Also note cloud.google.com/gke-accelerator: "nvidia-tesla-t4" and autopilot.gke.io/host-port-assignment annotation, that ensure that we pick the right node type as well as enable host port on Autopilot.
apiVersion: v1
kind: Pod
metadata:
  name: app
  annotations:
    autopilot.gke.io/host-port-assignment: '{"min":6006,"max":8888}'
  labels:
    service: app
spec:
  nodeSelector:
    cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
  containers:
    - command: ["/bin/sh", "-c"]
      args: ["jupyter lab --ip 0.0.0.0 --port=8888 --no-browser --allow-root"]
      image: <IMAGE>
      name: app
      ports:
        - containerPort: 8888
          hostPort: 8888
        - containerPort: 6006
          hostPort: 6006
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          cpu: "18"
          memory: "18Gi"
      tty: true
  restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: app
spec:
  type: LoadBalancer
  ports:
    - name: "8888"
      port: 8888
      targetPort: 8888
    - name: "6006"
      port: 6006
      targetPort: 6006
  selector:
    service: app