canonical / data-science-stack

Stack with machine learning tools needed for local development.
Apache License 2.0
18 stars 7 forks source link

feat: update create command to support Intel GPU notebooks #162

Closed DnPlas closed 3 months ago

DnPlas commented 3 months ago

This commit introduces an automatic way of scheduling Notebook Servers on Nodes labeled as intel.feature.node.kubernetes.io/gpu. This commit also affects the command and args that run for the Notebook Servers containers, as these are now part of the requirements for making Intel GPUs work. They should be exactly as this.

Fixes #147

Manual testing

Assuming you have a microk8s cluster with hostpath-storage enabled

  1. Clone this repository, checkout to the branch of this PR
  2. Build and install from source pip install .
  3. Initialise dss initialize --kubeconfig="$(sudo microk8s config)"
  4. Label the node to simulate the Intel plugin has done its job kubectl label node <name of your node> intel.feature.node.kubernetes.io/gpu=true
  5. Create a notebook dss create my-notebook --image=ubuntu
  6. Verify the Deployment of the server has the resource limits we are interested in:
kubectl get deployment -ndss my-notebook -oyaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    ...
    spec:
      containers:
      - env:
        - name: MLFLOW_TRACKING_URI
          value: http://mlflow.dss.svc.cluster.local:5000
        image: kubeflownotebookswg/jupyter-pytorch-full:v1.8.0
        imagePullPolicy: IfNotPresent
        name: my-notebook
        ports:
        - containerPort: 8888
          name: notebook-port
          protocol: TCP
        resources:
          limits:
            gpu.intel.com/i915: "1" # <--- we are interested in this
  1. Also verify that the command and args are always set:
          command:
            - jupyter
          args:
            - lab
            - --notebook-dir=/home/jovyan
            - --ip=0.0.0.0
            - --no-browser
            - --allow-root
            - --port=8888
            - --ServerApp.token=''
            - --ServerApp.password=''
            - --ServerApp.allow_origin='*'
            - --ServerApp.allow_remote_access=True
            - --ServerApp.authenticate_prometheus=False
            - --ServerApp.base_url='/'

Alternatively, you can try the create command w/o labelling the Node. In that case, the Deployment should not have any of the resource limits.

misohu commented 3 months ago

We need to add command and args section to all deployments according to the spec https://github.com/canonical/data-science-stack/compare/main...frenchwr/intel-gpu-integration#diff-e5f395a6247e35966f0a29978433e744bacd4913c9101d1ca46e364bdc249293