notes for gpu set up on kubernetes

eeholmes commented 6 months ago

https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool

1 GPU; 16 Gig RAM AWS: g4dn.xlarge $385/mo GCP: n1-standard-4, nvidia-tesla-t4 attached to n1 family Azure: Standard_NC4as_T4_v3 $383/mo

https://www.earthdata.nasa.gov/esds/competitive-programs/access/pangeo-ml

https://hub.docker.com/r/pangeo/ml-notebook/tags

Instructions https://z2jh.jupyter.org/en/latest/jupyterhub/customizing/user-resources.html#set-user-gpu-guarantees-limits

eeholmes commented 4 months ago

https://discourse.jupyter.org/t/jupyterhub-dockerspawner-podman-and-gpu/26447

eeholmes commented 4 months ago

Got GPU credits on Azure for Standard_NC4as_T4_v3
Added Standard_NC4as_T4_v3 node pool. min 1 max 2

Added this to the config file

  - display_name: NVIDIA Tesla T4, 28 GB, 4 CPUs
    description: "Start a container on a dedicated node with a GPU"
    slug: "gpu"
    profile_options:
      image:
        display_name: Image
        choices:
          pytorch:
            display_name: Pangeo PyTorch ML Notebook
            default: true
            slug: "pytorch"
            kubespawner_override:
              image: "quay.io/pangeo/pytorch-notebook:2023.09.19"
    kubespawner_override:
      environment:
        NVIDIA_DRIVER_CAPABILITIES: compute,utility
      mem_limit: null
      mem_guarantee: 14G
      node_selector:
        node.kubernetes.io/instance-type: Standard_NC4as_T4_v3

I think the pangeo pytorch image has the drivers

eeholmes commented 4 months ago

Notes on Pangeo Deep-Learning

https://medium.com/pangeo/deep-learning-with-gpus-on-pangeo-9466e25bfd74

Scott et al debugging set up on AWS https://github.com/pangeo-data/pangeo-cloud-federation/issues/490

eeholmes commented 4 months ago

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994

SAFS-Varanasi-Internship / Summer-2024

notes for gpu set up on kubernetes #4