canonical / data-science-stack

Stack with machine learning tools needed for local development.
Apache License 2.0
11 stars 3 forks source link

Familiarize with Intel DSS environment #145

Closed orfeas-k closed 1 month ago

orfeas-k commented 1 month ago

Why it needs to get done

In order to be able to tackle https://github.com/canonical/data-science-stack/issues/144, we 'll need first to spend some time to familiarize with the Intel DSS environment.

What needs to get done

Interact with Intel DSS environment and document instructions for it.

When is the task considered done

We have familiarized and documented how to interact with the Intel DSS environment

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6002.

This message was autogenerated

misohu commented 1 month ago

To proceed with the spec for DSS Intel integration we need to answer following questions:

I went through older poc guide for setting up the Intel support on microk8s and I also went through the new spec which we received.

How to install gpu operator for Intel Hardware on Microk8s?

In order to install intel gpu plugin to microk8s we need

  1. Node feature discovery manifests and rules. These are responsible for labeling the nodes with required labels and annotations.
  2. The GPU plugin daemonset. which will install the plugin on nodes which have correct labels.

NOTE: The script to generate yamls is here.

How to get jupyter backed images for pytorch and tensorfow?

Currently we should be using the

Both of the images come with present jupyter. Keep in mind that there is a setting on the pod's site we need to do in order to run it correctly in DSS.

How to support multiple containers on one Intel gpu device?

There is a setting in the intel-gpu-plugin. Which enables sharing the gpu accross multiple containers. Without it only one container can get the device. To assign the GPU to container we need this setting. Here is discussion about the setting.

How to support iGPUs and dGPUs Intel devices at the same time?

This one may not be supported. Please check the discussion.

How to support Intel and NVIDIA workloads at the same time?

The initial tests in this doc show that it is possible without problems.

Where we gonna develop the DSS Intel suppport?

Wainting for access to machines with iGPU and dGPU. After that I will rerun all the tests from this doc.

How we gonna run CI for Intel support?

This might be a challenge we might need a way to access on demand an instance with iGPU dGPU for CI testing.

misohu commented 1 month ago

Today I got the access to the dell device lab and successfully execute test cases from this spec. Namely:

The process to get access to the lab

❯ cat dell-precision3470-c30322.yaml
job_queue: dell-precision-3470-c30322
provision_data:
  distro: noble
test_data:
  test_cmds: |
    ssh $DEVICE_IP sudo apt -y install git
reserve_data:
  ssh_keys:
    - lp:michalhucko
  timeout: 43200
misohu commented 1 month ago

Changes needed for dss Intel support

  1. Add intel status to dss status command.

Right now the dss status command outputs this information

[INFO] MLflow deployment: Ready
[INFO] MLflow URL: http://10.152.183.68:5000
[INFO] GPU acceleration: Disabled

We need to add one more row about Intel status. The correct way to get the intel device info is under discussion here. First idea is to check for intel gpu labels on kubernetes node.

  1. Add functionality to create intel gpu instances with create command In order to create a kubernetes pod with the intel gpu acceleration enabled we must:
    • Have the intel GPU operator enabled in Kubernetes cluster.
    • Have the Kubernetes resources section filled with the gpu.intel.com/i915 section. Example here.

After discussing with the team we decided to drop the --gpu intel argument from the dss create command. If the Intel acceleration is enabled (by user manually deploying the intel gpu operator) , all the notebooks will have the intel resources section filled automatically. Meaning that having correct image notebooks can use Intel hardware. This is not the problem for images without intel librarries as they will not use the resource anyways.

Because of this the dss create should check for the presence of intel gpu plugin. I f the plugin is there it will automatically populate the resources section.

Because we are using intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter and intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter images for intel ML notebooks we also need to adjust the command and args section (check the example). We can add these settings blobally to all dss notebook deployments as non intel ones are setting these in their Dockerfiles anyways (This I need to test).

We also need to add intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter and intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyter images are recommendations to dss create --help

  1. Docs on how to setup intel device plugin We can use the setup described in this spec . Procedure is to deploy the devvice plugin manifests which we now keep in the dss repo here . There is a microk8s problem when deploying manifests from the URL. When fixed we can deploy directly form upstream.

  2. Docs on How to spin up Notebook from Intel with IPEX or ITEX After implementing points 1 and 2. User can simply deploy intel notebooks with following commands (this is only possible when the device plugin is enabled otherwise the notebooks will be deployed but resources will not be available).

dss create my-itex-notebook --image=intel/intel-extension-for-tensorflow:2.15.0-xpu-idp-jupyter
dss create my-ipex-notebook --image=intel/intel-extension-for-pytorch:2.1.20-xpu-idp-jupyte
  1. Documentation on how to run simple calculations with Intel ML frameworks in DSS
  2. Documentation on supported versions of Intel GPUs Regarding this point I need to reach out to intel team
misohu commented 1 month ago

As part of this task we have opnned following issues:

https://github.com/canonical/data-science-stack/issues/146 https://github.com/canonical/data-science-stack/issues/147 https://github.com/canonical/data-science-stack/issues/148 https://github.com/canonical/data-science-stack/issues/149 https://github.com/canonical/data-science-stack/issues/150

When designing the spec we need to align on following open problems:

How are we going to recommend installation of theIntel device plugin?

Accoding to this spec we need to instruct the user to build the manifests from upstream repository as microk8s has problems with remote urls for its customization feature. the aforementioned spec recommands to keep the built manifests in the DSS repository. This is not ideal solution as DSS should not be responsible for installing the device plugin.

Should we be specific about Intel GPUs' versions which we support with DSS?

As DSS is not responsible for setting up the plugin, it should not care about the versions of the underlying Intel GPUs. User should handle the correct plugin installation with the correct GPU device.

How to support iGPUs and dGPUs Intel devices at the same time?

This one may not be supported. Please check the discussion.

mvlassis commented 1 month ago

@misohu Your exploration of the Intel DSS environment has been very thorough, and you have specified very clearly defined tasks in order to achieve the integration. Great job!

The only thing that I find missing is to determine clearly whether iGPUs and dGPUs Intel devices will be supported simultaneously, before we proceed with the spec.

misohu commented 1 month ago

Thanks @mvlassis

The thing is that devices with both Intel iGPUs and dGPUs will be support just we cannot specify in the resources section if the workload should be deployed to iGPU or dGPU.

mvlassis commented 1 month ago

@misohu If that is the case we should add a note/warning in the DSS documentation for that specific usecase.