utterances-bot commented 1 year ago

Christian Mills - Getting Started with Intel’s PyTorch Extension for Arc GPUs on Ubuntu

This tutorial provides a step-by-step guide to setting up Intel’s PyTorch extension on Ubuntu to train models with Arc GPUs.

https://christianjmills.com/posts/intel-pytorch-extension-tutorial/native-ubuntu/

Danyal-sab commented 1 year ago

the mamba part doesn't work for me it installs mamba, however with a warning: please verify that your PYTHONPATH only points to directories of packages that are compatible with the Python interpreter in Mambaforge: /home/daaz/mambaforge

and does not recognize mamba

cj-mills commented 1 year ago

Hi @Danyal-sab,

Thanks for pointing that out. I forgot to include the line after running the Mambaforge install script to initialize Mamba. I updated the post with the missing line.

You can run the following commands to initialize Mamba and relaunch the current bash shell to apply the changes:

~/mambaforge/bin/mamba init
bash

I saw you posted another comment earlier, but it got deleted before I could respond. Did you resolve your previous issue?

Danyal-sab commented 1 year ago

Hi @cj-mills,

Thanks for your speedy reply. And thanks for your help.

And yes, I posted about a minor issue with executing the orders for "Install oneAPI Base Toolkit" section, where the second line gave typing error, I resolved it by simply removing the \s at the end of lines and after that it performed right. that's why I deleted the post. similar issue happened in the "Apply OneAPI Patch" section, gave an error in the 6th line. Again I just removed the \s and it went through.

Thanks again for your great help

Danyal-sab commented 1 year ago

Hi @cj-mills,

is there any tool for monitoring the Arc gpu memory usage (like nvidia-smi for nvidia)

I checked a few tools like "intel_gpu_top", intel Vtune and intel GPA, but they either they weren't compatible with ubuntu 23.04 or they don't offer monitoring gpu memory usage.

is there other tools we possibly can use?

cj-mills commented 1 year ago

Hi @Danyal-sab,

The only one I know of is the sysmon tool included with Intel's Profiling Tools Interfaces for GPU (PTI for GPU) GitHub project.

GitHub Repository

Unfortunately, you would need to compile the tool from the source code.

Also, it does not seem fully functional on my system, as it does not show any running processes:

$ sudo sysmon
=====================================================================================
GPU 0: Intel(R) Arc(TM) A770 Graphics    PCI Bus: 0000:03:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26241    Subdevices: 0
EU Count: 512    Threads Per EU: 8    EU SIMD Width: 8    Total Memory(MB): 15473.6
Core Frequency(MHz): 2000.0 of 2400.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: unknown
=====================================================================================
GPU 1: Intel(R) UHD Graphics 750    PCI Bus: 0000:00:02.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26241    Subdevices: 0
EU Count: 32    Threads Per EU: 7    EU SIMD Width: 8    Total Memory(MB): 25360.9
Core Frequency(MHz): 350.0 of 1300.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: unknown

ricable commented 1 year ago

great tutorial that helped me a lot setting up an environment for ML / DL with Arc GPU. It really saved my life and I hope to read more of this type of excellent materials. Thanks again, much appreciated

psmgeelen commented 1 year ago

I really appreciate your work! I am trying to set the GPU up for SciKit monkey-patch: https://github.com/intel/scikit-learn-intelex but I am struggling to go beyond the CPU acceleration. I have no idea how to 1. List the device and 2. to point to that device. Do you have any experience with that?

cj-mills commented 1 year ago

Hi @psmgeelen,

I have not tried Intel's Scikit-learn extension, so I don't know if it even supports Arc GPUs. The DPC++ compiler runtime does support Arc GPUs, meaning it should work in theory.

Have you tried the example code for performing computations on the GPU in the extension's documentation?

oneAPI and GPU support in Intel® Extension for Scikit-learn

Based on the example code, the Arc GPU should be the "gpu:0" device, assuming it is the only discrete GPU installed on the system. The integrated graphics should be the "gpu:1" device.

Danyal-sab commented 1 year ago

Hi @cj-mills, I got an issue which may be slightly unrelated to this topic. At the beginning of working with my a770 it used to crash the system when running a training session after 10ish epochs under medium load. (I don't know exactly how much the load was, as I couldn't monitor the GPU at all, However using my old NVIDIA 1070 the code used to use less than 2 GB of the GPU ram.)

Then I started running heavier codes close to the limit of the a770 and then the crashes stopped for the day. After two more days it totally stopped crashing. Now, every now and then it does crash the computer while deep learning training session is running. Do you have any idea what can be the cause and why it isn't consistent? I looked up online, there are quite a few who experienced crashing with this card, but didn't find anyone with occasional crash down. And also the card makes a noise as well that changes from time to time. I suppose it should be coil whining, is it safe?

psmgeelen commented 1 year ago

@cj-mills , I have, and it's not finding the device for whatever reason..I created a ticket at intelex here: https://github.com/intel/scikit-learn-intelex/issues/1357#issuecomment-1632484008

Danyal-sab commented 11 months ago

Hi @cj-mills, I see that the tutorial has been updated to use the new extension. I saw in your fastai forum that you concluded the extension has a bug. is it safe to install now?

cj-mills commented 11 months ago

@Danyal-sab It depends on what you need to use it for. The code for my image classification tutorial works fine, but the training code for my YOLOX tutorial does not reach usable performance with the intel extension on the ARC GPU.

I have not tested the YOLOX training code with the previous extension because the code requires torchvision 0.15+ (which requires PyTorch 2.0+).

I updated the tutorial because everything I tested that worked with the previous extension version still works with the new version, and the current Ubuntu LTS now ships with a kernel that supports ARC GPUs.

Danyal-sab commented 11 months ago

@cj-mills, Thanks for updating the tutorial. Just a minor change is needed for the "Update PyTorch Imports" section: As you provided in the sample codes, on of the import lines, (from torcheval.tools import get_module_summary) should be replaced with this:

from torchtnt.utils import get_module_summary

Thanks again for your great help

cj-mills commented 11 months ago

@Danyal-sab Thanks for catching that!

Danyal-sab commented 11 months ago

@cj-mills, after upgrading to the newer version it worked well. Yesterday I updated the gpu drivers too (as ubuntu offers available software update when they are available). After that the performance dropped significantly. It take almost three times to run the same code before updating the drivers. have you tested that?

cj-mills commented 11 months ago

@Danyal-sab I don't run the Arc GPU as my daily driver, so I have not used it for nearly a month. I was not planning to install it back into my desktop until Intel's PyTorch extension gets a new update.

It sounds like a similar performance difference to not having the IPEX_XPU_ONEDNN_LAYOUT environment variable set. I don't know if that's related to your issue, but maybe try setting that environment variable to 0 and 1 to see if it impacts performance.

It might also just be a bad driver update. Can you roll back to the previous driver version?

Danyal-sab commented 8 months ago

@cj-mills, Thanks again for your support. That time I went back to previous version, however as lately there has been an upgrade for the extension, I decided to try it again. After a quick search in the web it seems to me that this extension still does not fully support python 3.11. And probably that is the issue, what do you think?

cj-mills commented 8 months ago

@Danyal-sab, I've been meaning to go in-depth with the most recent release of the extension and Intel's BigDL-LLM library, but I have not had time yet.

I briefly swapped in the Arc card a couple of weeks ago, and the training notebooks that worked in the previous versions no longer produced usable models. It was the same issue I described here, but it occurred even with the baseline image classification notebook.

I think I tried with Python 3.9, 3.10, and 3.11, and I had the same issue with all of them. I did not have time to investigate, so I held off making a post about it.

Danyal-sab commented 8 months ago

@cj-mills, thanks for your response. Are you going to try with the newest version sometime soon?

cj-mills commented 8 months ago

@Danyal-sab, That was with version 2.1.10+xpu. I don't currently know if the source of the issue is the extension or the oneAPI Base Toolkit (or both).

Danyal-sab commented 8 months ago

@cj-mills, you are right. 2.1.10+xpu doesn't seem to be the stable version yet, as it can seen in the repository recommends 2.0.110+xpu for install. https://github.com/intel/intel-extension-for-pytorch

Danyal-sab commented 8 months ago

@cj-mills, Alright, then. I am still using version 1.13.0a+xpu Do you thing it makes sense for me to move to 2.0.110+xpu? and if so shall I use python 3.10 or 3.11?

contryboy commented 6 months ago

Hi @cj-mills and everone, First of all thanks for the doc. I am able to install everything as mentioned but with latest versions (of oneAPI Base Toolkit and python packages), and run the notebook it is as faster as your example (around 12 minutes). How ever, the accuracy stopped around 0.18 and not improved further even after 3 epochs. I also tried changing following line

model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16) to model, optimizer = ipex.optimize(model, optimizer=optimizer)

But it does not make the things better. Do you have any idea what could be the problem?

Thanks in advance!

cj-mills commented 6 months ago

Hi @contryboy, Your experience matches my brief testing of the v2.1.10+xpu release. I did not have time to investigate the issue further, so I did not make a post about it. It's the same issue I described with the 2.0.110+xpu release for my YOLOX training notebook. However, with v2.1.10+xpu, it occurred even with the baseline image classification notebook.

I have not had a chance to investigate the source of the issue, but I plan to give it another shot when the next xpu release comes out.

contryboy commented 6 months ago

Hi @cj-mils, Thanks for quick reply. I also tried the sample code in their official doc [1], it has the same problem. So I created an issue [2] in their github project, see if there would be any findings.

[1] https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html#float32 [2] https://github.com/intel/intel-extension-for-pytorch/issues/537

cj-mills commented 6 months ago

@contryboy Nice! It would certainly be more convenient for me if they resolved the issue for the next release.

Danyal-sab commented 2 months ago

Hi @cj-mills, Are you planning to update the tutorial with the latest version?

cj-mills commented 2 months ago

@Danyal-sab, I will when I have enough time to swap my Arc card into my desktop and test the latest version. I've been too busy with work projects lately to swap out my NVIDIA card.

cj-mills / christianjmills

posts/intel-pytorch-extension-tutorial/native-ubuntu/ #38

Christian Mills - Getting Started with Intel’s PyTorch Extension for Arc GPUs on Ubuntu