canonical / data-science-stack

Stack with machine learning tools needed for local development.
Apache License 2.0
16 stars 6 forks source link

Write `dss` doc about how to setup Intel GPU device plugin to microk8s #148

Closed misohu closed 2 months ago

misohu commented 2 months ago

Why it needs to get done

Inter GPU operator is a prerequisite for running Intel workloads on DSS. In this spec we need to describe how the end user should install the operator before using DSS. DSS is not installing this to user cluster.

We can use the setup described in this spec . Procedure is to deploy the device plugin manifests which we now keep in the dss repo here . There is a microk8s problem when deploying manifests from the URL. When fixed we can deploy directly form upstream.

What needs to get done

When is the task considered done

syncronize-issues-to-jira[bot] commented 2 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6037.

This message was autogenerated

mvlassis commented 2 months ago

The setup of the Intel GPU plugin follows this documentation from the intel-dss-device-plugins-for-kubernetes repo. When using the snap package of kubectl, the commands are executed successfully. In our case, we want to replace the kubectl commands with microk8s.kubectl. However, there is a known issue with the kustomize subcommand of microk8s.kubectl. We get the following error message:

sudo microk8s.kubectl kustomize https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd?ref=${VERSION} > node_feature_discovery.yaml
error: failed to run '/snap/microk8s/7039/usr/bin/git fetch --depth=1 https://github.com/intel/intel-device-plugins-for-kubernetes v0.30.0': fatal: couldn't find remote ref v0.30.0
: exit status 128

As this comment suggests, the issue seems to be the version of git that microk8s uses (2.25.1), since the kustomize subcommand internally calls git fetch.

An idea I tried for solving this is by cloning the intel-dss-device-plugins-for-kubernetes repo, checking out to the correct tag (v0.30.0 in our case), and then running microk8s.kubectl kustomize on the local copy of the repo. However, I do receive a similar error:

VERSION=v0.30.0
git clone https://github.com/intel/intel-device-plugins-for-kubernetes.git --branch ${VERSION} --single-branch
sudo microk8s.kubectl kustomize intel-device-plugins-for-kubernetes/deployments/nfd > node_feature_discovery.yaml

error: accumulating resources: accumulation err='accumulating resources from 'base': '/home/ubuntu/intel-device-plugins-for-kubernetes/deployments/nfd/base' must resolve to a file': recursed accumulation of path '/home/ubuntu/intel-device-plugins-for-kubernetes/deployments/nfd/base': accumulating resources: accumulation err='accumulating resources from 'https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.15.4': URL is a git repository': failed to run '/snap/microk8s/7039/usr/bin/git fetch --depth=1 https://github.com/kubernetes-sigs/node-feature-discovery v0.15.4': fatal: couldn't find remote ref v0.15.4
: exit status 128

This is because the base directory has this line that also specifies a remote URL, so git-fetch is once again called, and the command fails in a similar fashion.

I propose the following 3 solutions:

mvlassis commented 2 months ago

After discussion with the team, we are proceeding with Solution A: The doc will include the installation of the kubectl snap, and all commands will use kubectl instead of microk8s.kubectl.