Open utterances-bot opened 1 year ago
the mamba part doesn't work for me it installs mamba, however with a warning: please verify that your PYTHONPATH only points to directories of packages that are compatible with the Python interpreter in Mambaforge: /home/daaz/mambaforge
and does not recognize mamba
Hi @Danyal-sab,
Thanks for pointing that out. I forgot to include the line after running the Mambaforge install script to initialize Mamba. I updated the post with the missing line.
You can run the following commands to initialize Mamba and relaunch the current bash shell to apply the changes:
~/mambaforge/bin/mamba init
bash
I saw you posted another comment earlier, but it got deleted before I could respond. Did you resolve your previous issue?
Hi @cj-mills,
Thanks for your speedy reply. And thanks for your help.
And yes, I posted about a minor issue with executing the orders for "Install oneAPI Base Toolkit" section, where the second line gave typing error, I resolved it by simply removing the \s at the end of lines and after that it performed right. that's why I deleted the post. similar issue happened in the "Apply OneAPI Patch" section, gave an error in the 6th line. Again I just removed the \s and it went through.
Thanks again for your great help
Hi @cj-mills,
is there any tool for monitoring the Arc gpu memory usage (like nvidia-smi for nvidia)
I checked a few tools like "intel_gpu_top", intel Vtune and intel GPA, but they either they weren't compatible with ubuntu 23.04 or they don't offer monitoring gpu memory usage.
is there other tools we possibly can use?
Hi @Danyal-sab,
The only one I know of is the sysmon
tool included with Intel's Profiling Tools Interfaces for GPU (PTI for GPU) GitHub project.
Unfortunately, you would need to compile the tool from the source code.
Also, it does not seem fully functional on my system, as it does not show any running processes:
$ sudo sysmon
=====================================================================================
GPU 0: Intel(R) Arc(TM) A770 Graphics PCI Bus: 0000:03:00.0
Vendor: Intel(R) Corporation Driver Version: 1.3.26241 Subdevices: 0
EU Count: 512 Threads Per EU: 8 EU SIMD Width: 8 Total Memory(MB): 15473.6
Core Frequency(MHz): 2000.0 of 2400.0 Core Temperature(C): unknown
=====================================================================================
Running Processes: unknown
=====================================================================================
GPU 1: Intel(R) UHD Graphics 750 PCI Bus: 0000:00:02.0
Vendor: Intel(R) Corporation Driver Version: 1.3.26241 Subdevices: 0
EU Count: 32 Threads Per EU: 7 EU SIMD Width: 8 Total Memory(MB): 25360.9
Core Frequency(MHz): 350.0 of 1300.0 Core Temperature(C): unknown
=====================================================================================
Running Processes: unknown
great tutorial that helped me a lot setting up an environment for ML / DL with Arc GPU. It really saved my life and I hope to read more of this type of excellent materials. Thanks again, much appreciated
I really appreciate your work! I am trying to set the GPU up for SciKit monkey-patch: https://github.com/intel/scikit-learn-intelex but I am struggling to go beyond the CPU acceleration. I have no idea how to 1. List the device and 2. to point to that device. Do you have any experience with that?
Hi @psmgeelen,
I have not tried Intel's Scikit-learn extension, so I don't know if it even supports Arc GPUs. The DPC++ compiler runtime does support Arc GPUs, meaning it should work in theory.
Have you tried the example code for performing computations on the GPU in the extension's documentation?
Based on the example code, the Arc GPU should be the "gpu:0" device, assuming it is the only discrete GPU installed on the system. The integrated graphics should be the "gpu:1" device.
Hi @cj-mills, I got an issue which may be slightly unrelated to this topic. At the beginning of working with my a770 it used to crash the system when running a training session after 10ish epochs under medium load. (I don't know exactly how much the load was, as I couldn't monitor the GPU at all, However using my old NVIDIA 1070 the code used to use less than 2 GB of the GPU ram.)
Then I started running heavier codes close to the limit of the a770 and then the crashes stopped for the day. After two more days it totally stopped crashing. Now, every now and then it does crash the computer while deep learning training session is running. Do you have any idea what can be the cause and why it isn't consistent? I looked up online, there are quite a few who experienced crashing with this card, but didn't find anyone with occasional crash down. And also the card makes a noise as well that changes from time to time. I suppose it should be coil whining, is it safe?
@cj-mills , I have, and it's not finding the device for whatever reason..I created a ticket at intelex here: https://github.com/intel/scikit-learn-intelex/issues/1357#issuecomment-1632484008
Hi @cj-mills, I see that the tutorial has been updated to use the new extension. I saw in your fastai forum that you concluded the extension has a bug. is it safe to install now?
@Danyal-sab It depends on what you need to use it for. The code for my image classification tutorial works fine, but the training code for my YOLOX tutorial does not reach usable performance with the intel extension on the ARC GPU.
I have not tested the YOLOX training code with the previous extension because the code requires torchvision 0.15+ (which requires PyTorch 2.0+).
I updated the tutorial because everything I tested that worked with the previous extension version still works with the new version, and the current Ubuntu LTS now ships with a kernel that supports ARC GPUs.
@cj-mills, Thanks for updating the tutorial. Just a minor change is needed for the "Update PyTorch Imports" section: As you provided in the sample codes, on of the import lines, (from torcheval.tools import get_module_summary) should be replaced with this:
from torchtnt.utils import get_module_summary
Thanks again for your great help
@Danyal-sab Thanks for catching that!
@cj-mills, after upgrading to the newer version it worked well. Yesterday I updated the gpu drivers too (as ubuntu offers available software update when they are available). After that the performance dropped significantly. It take almost three times to run the same code before updating the drivers. have you tested that?
@Danyal-sab I don't run the Arc GPU as my daily driver, so I have not used it for nearly a month. I was not planning to install it back into my desktop until Intel's PyTorch extension gets a new update.
It sounds like a similar performance difference to not having the IPEX_XPU_ONEDNN_LAYOUT environment variable set. I don't know if that's related to your issue, but maybe try setting that environment variable to 0 and 1 to see if it impacts performance.
It might also just be a bad driver update. Can you roll back to the previous driver version?
@cj-mills, Thanks again for your support. That time I went back to previous version, however as lately there has been an upgrade for the extension, I decided to try it again. After a quick search in the web it seems to me that this extension still does not fully support python 3.11. And probably that is the issue, what do you think?
@Danyal-sab, I've been meaning to go in-depth with the most recent release of the extension and Intel's BigDL-LLM library, but I have not had time yet.
I briefly swapped in the Arc card a couple of weeks ago, and the training notebooks that worked in the previous versions no longer produced usable models. It was the same issue I described here, but it occurred even with the baseline image classification notebook.
I think I tried with Python 3.9, 3.10, and 3.11, and I had the same issue with all of them. I did not have time to investigate, so I held off making a post about it.
@cj-mills, thanks for your response. Are you going to try with the newest version sometime soon?
@Danyal-sab,
That was with version 2.1.10+xpu
. I don't currently know if the source of the issue is the extension or the oneAPI Base Toolkit (or both).
@cj-mills, you are right. 2.1.10+xpu doesn't seem to be the stable version yet, as it can seen in the repository recommends 2.0.110+xpu for install. https://github.com/intel/intel-extension-for-pytorch
@cj-mills, Alright, then. I am still using version 1.13.0a+xpu Do you thing it makes sense for me to move to 2.0.110+xpu? and if so shall I use python 3.10 or 3.11?
Hi @cj-mills and everone, First of all thanks for the doc. I am able to install everything as mentioned but with latest versions (of oneAPI Base Toolkit and python packages), and run the notebook it is as faster as your example (around 12 minutes). How ever, the accuracy stopped around 0.18 and not improved further even after 3 epochs. I also tried changing following line
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
to
model, optimizer = ipex.optimize(model, optimizer=optimizer)
But it does not make the things better. Do you have any idea what could be the problem?
Thanks in advance!
Hi @contryboy,
Your experience matches my brief testing of the v2.1.10+xpu
release. I did not have time to investigate the issue further, so I did not make a post about it. It's the same issue I described with the 2.0.110+xpu
release for my YOLOX training notebook. However, with v2.1.10+xpu
, it occurred even with the baseline image classification notebook.
I have not had a chance to investigate the source of the issue, but I plan to give it another shot when the next xpu
release comes out.
Hi @cj-mils, Thanks for quick reply. I also tried the sample code in their official doc [1], it has the same problem. So I created an issue [2] in their github project, see if there would be any findings.
[1] https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html#float32 [2] https://github.com/intel/intel-extension-for-pytorch/issues/537
@contryboy Nice! It would certainly be more convenient for me if they resolved the issue for the next release.
Hi @cj-mills, Are you planning to update the tutorial with the latest version?
@Danyal-sab, I will when I have enough time to swap my Arc card into my desktop and test the latest version. I've been too busy with work projects lately to swap out my NVIDIA card.
Christian Mills - Getting Started with Intel’s PyTorch Extension for Arc GPUs on Ubuntu
This tutorial provides a step-by-step guide to setting up Intel’s PyTorch extension on Ubuntu to train models with Arc GPUs.
https://christianjmills.com/posts/intel-pytorch-extension-tutorial/native-ubuntu/