Familiarize with Intel DSS environment

orfeas-k commented 1 month ago

Why it needs to get done

In order to be able to tackle https://github.com/canonical/data-science-stack/issues/144, we 'll need first to spend some time to familiarize with the Intel DSS environment.

What needs to get done

Interact with Intel DSS environment and document instructions for it.

When is the task considered done

We have familiarized and documented how to interact with the Intel DSS environment

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6002.

This message was autogenerated

misohu commented 1 month ago

To proceed with the spec for DSS Intel integration we need to answer following questions:

How to install gpu operator for Intel Hardware on Microk8s?
How to get jupyter backed images for pytorch and tensorfow?
How to support multiple containers on one Intel gpu device?
How to support iGPUs and dGPUs Intel devices at the same time?
How to support Intel and NVIDIA workloads at the same time?
Where we gonna develop the DSS Intel suppport?
How we gonna run CI for Intel support?

I went through older poc guide for setting up the Intel support on microk8s and I also went through the new spec which we received.

How to install gpu operator for Intel Hardware on Microk8s?

In order to install intel gpu plugin to microk8s we need

Node feature discovery manifests and rules. These are responsible for labeling the nodes with required labels and annotations.
The GPU plugin daemonset. which will install the plugin on nodes which have correct labels.

NOTE: The script to generate yamls is here.

How to get jupyter backed images for pytorch and tensorfow?

Currently we should be using the

ITEX - intel tensorflow extension - intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter
IPEX - intel pytorch extension - intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter

Both of the images come with present jupyter. Keep in mind that there is a setting on the pod's site we need to do in order to run it correctly in DSS.

How to support multiple containers on one Intel gpu device?

There is a setting in the intel-gpu-plugin. Which enables sharing the gpu accross multiple containers. Without it only one container can get the device. To assign the GPU to container we need this setting. Here is discussion about the setting.

How to support iGPUs and dGPUs Intel devices at the same time?

This one may not be supported. Please check the discussion.

How to support Intel and NVIDIA workloads at the same time?

The initial tests in this doc show that it is possible without problems.

Where we gonna develop the DSS Intel suppport?

Wainting for access to machines with iGPU and dGPU. After that I will rerun all the tests from this doc.

How we gonna run CI for Intel support?

This might be a challenge we might need a way to access on demand an instance with iGPU dGPU for CI testing.

misohu commented 1 month ago

Today I got the access to the dell device lab and successfully execute test cases from this spec. Namely:

I was able to dpeloy dss with intel plugin.
I was able to run ipex and itex images as dss notebooks without problems.
I executed ML workloads for given ipex itex ML frameworks.
I also tried to run multiple notebooks at the same time.

The process to get access to the lab

Install testflinger to run workload jobs in Intel device lab
Setup tw or us vpn
Ask in testflinger channel to add you to the canonical-vpn-taipei-vpn (here is my thread)
Prepare a testflinger job to create instance, add you lpid and run sleep so you can ssh into machine.

❯ cat dell-precision3470-c30322.yaml
job_queue: dell-precision-3470-c30322
provision_data:
  distro: noble
test_data:
  test_cmds: |
    ssh $DEVICE_IP sudo apt -y install git
reserve_data:
  ssh_keys:
    - lp:michalhucko
  timeout: 43200

Execute the job
```
testflinger submit --poll dell-precision3470-c30322.yaml
```
- Wait for job to be scheduled, machine to be created. After that you will see ssh instructions on how to access machine. Take a note of the job if (it will be print to the screen at the end). Be sure to stay on VPN
- SSH into machine
- At the end kill the job to release the resources
```
# example id
testflinger-cli cancel b525f94b-ab53-4310-83d7-04664c569303
```
Note: read more about the procedure here.

misohu commented 1 month ago

Changes needed for dss Intel support

Add intel status to dss status command.

Right now the dss status command outputs this information

[INFO] MLflow deployment: Ready
[INFO] MLflow URL: http://10.152.183.68:5000
[INFO] GPU acceleration: Disabled

We need to add one more row about Intel status. The correct way to get the intel device info is under discussion here. First idea is to check for intel gpu labels on kubernetes node.

Add functionality to create intel gpu instances with create command In order to create a kubernetes pod with the intel gpu acceleration enabled we must:
- Have the intel GPU operator enabled in Kubernetes cluster.
- Have the Kubernetes resources section filled with the gpu.intel.com/i915 section. Example here.

After discussing with the team we decided to drop the --gpu intel argument from the dss create command. If the Intel acceleration is enabled (by user manually deploying the intel gpu operator) , all the notebooks will have the intel resources section filled automatically. Meaning that having correct image notebooks can use Intel hardware. This is not the problem for images without intel librarries as they will not use the resource anyways.

Because of this the dss create should check for the presence of intel gpu plugin. I f the plugin is there it will automatically populate the resources section.

Because we are using intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter and intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter images for intel ML notebooks we also need to adjust the command and args section (check the example). We can add these settings blobally to all dss notebook deployments as non intel ones are setting these in their Dockerfiles anyways (This I need to test).

We also need to add intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter and intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter images are recommendations to dss create --help

Docs on how to setup intel device plugin We can use the setup described in this spec . Procedure is to deploy the devvice plugin manifests which we now keep in the dss repo here . There is a microk8s problem when deploying manifests from the URL. When fixed we can deploy directly form upstream.
Docs on How to spin up Notebook from Intel with IPEX or ITEX After implementing points 1 and 2. User can simply deploy intel notebooks with following commands (this is only possible when the device plugin is enabled otherwise the notebooks will be deployed but resources will not be available).

dss create my-itex-notebook --image=intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter
dss create my-ipex-notebook --image=intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyte

Documentation on how to run simple calculations with Intel ML frameworks in DSS
Documentation on supported versions of Intel GPUs Regarding this point I need to reach out to intel team

misohu commented 1 month ago

As part of this task we have opnned following issues:

https://github.com/canonical/data-science-stack/issues/146 https://github.com/canonical/data-science-stack/issues/147 https://github.com/canonical/data-science-stack/issues/148 https://github.com/canonical/data-science-stack/issues/149 https://github.com/canonical/data-science-stack/issues/150

When designing the spec we need to align on following open problems:

How are we going to recommend installation of theIntel device plugin?

Accoding to this spec we need to instruct the user to build the manifests from upstream repository as microk8s has problems with remote urls for its customization feature. the aforementioned spec recommands to keep the built manifests in the DSS repository. This is not ideal solution as DSS should not be responsible for installing the device plugin.

Should we be specific about Intel GPUs' versions which we support with DSS?

As DSS is not responsible for setting up the plugin, it should not care about the versions of the underlying Intel GPUs. User should handle the correct plugin installation with the correct GPU device.

How to support iGPUs and dGPUs Intel devices at the same time?

This one may not be supported. Please check the discussion.

mvlassis commented 1 month ago

@misohu Your exploration of the Intel DSS environment has been very thorough, and you have specified very clearly defined tasks in order to achieve the integration. Great job!

The only thing that I find missing is to determine clearly whether iGPUs and dGPUs Intel devices will be supported simultaneously, before we proceed with the spec.

misohu commented 1 month ago

Thanks @mvlassis

The thing is that devices with both Intel iGPUs and dGPUs will be support just we cannot specify in the resources section if the workload should be deployed to iGPU or dGPU.

mvlassis commented 1 month ago

@misohu If that is the case we should add a note/warning in the DSS documentation for that specific usecase.

canonical / data-science-stack