Open VincyZhang opened 4 months ago
The issue you are facing seems to be related to the proper configuration and usage of the Intel Xeon Scalable Processors (Flex 170) on your system. You have provided some system debug information which indicates that the GPU is not being utilized.
To resolve this, you may need to follow these steps:
Ensure that you have installed the necessary drivers and software for the Intel Xeon Scalable Processors (Flex 170) on your system.
Check if the GPU is properly connected to the system and has the correct power supply.
Make sure that the GPU is enabled in the BIOS settings.
Update your system's BIOS to the latest version, as it may contain improvements for the GPU.
If you are using a Linux system, make sure that you have installed the appropriate GPU drivers and libraries.
If you are using a Windows system, you may need to install the Intel OpenCL SDK for the Flex 170 GPU.
Once the GPU is detected and enabled, you can proceed with using it for your applications and tasks.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are still facing issues, you can refer to the Intel documentation and community forums for further assistance. If you need help, please @NeuralChatBot
The error you are encountering seems to be related to the installation and configuration of the required libraries and dependencies for the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU. To resolve this, you may need to follow these steps:
Ensure that you have installed the necessary drivers and software for the Intel Xeon Scalable Processors (Flex 170) on your system.
Check if the GPU is properly connected to the system and has the correct power supply.
Make sure that the GPU is enabled in the BIOS settings.
Update your system's BIOS to the latest version, as it may contain improvements for the GPU.
If you are using a Linux system, make sure that you have installed the appropriate GPU drivers and libraries.
If you are using a Windows system, you may need to install the Intel OpenCL SDK for the Flex 170 GPU.
Once the GPU is detected and enabled, you can proceed with using it for your applications and tasks.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are still facing issues, you can refer to the Intel documentation and community forums for further assistance.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are still facing issues, you can refer to the Intel documentation and community forums for further assistance.
@NeuralChatBot
The error you are encountering seems to be related to the installation and configuration of the required libraries and dependencies for the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU. To resolve this, you may need to follow these steps:
Ensure that you have installed the necessary drivers and software for the Intel Xeon Scalable Processors (Flex 170) on your system.
Check if the GPU is properly connected to the system and has the correct power supply.
Make sure that the GPU is enabled in the BIOS settings.
Update your system's BIOS to the latest version, as it may contain improvements for the GPU.
If you are using a Linux system, make sure that you have installed the appropriate GPU drivers and libraries.
If you are using a Windows system, you may need to install the Intel OpenCL SDK for the Flex 170 GPU.
Once the GPU is detected and enabled, you can proceed with using it for your applications and tasks.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are still facing issues, you can refer to the Intel documentation and community forums for further assistance.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are using the Intel Neural Compute Stick 2 (NCS2) with the Flex 170 GPU, you may need to follow additional steps to configure and use it.
If you are still facing issues, you can refer to the Intel documentation and community forums for further assistance.
The issue you are facing seems to be related to the proper configuration and setup of the Intel Xeon Scalable Processors (Flex 170) on your system. You have provided some system debug information which shows that the GPU is not being utilized.
To resolve this, you should follow the installation and configuration steps mentioned in the "Installation and Configuration" section of the Intel Xeon Scalable Processors (Flex 170) documentation. This includes setting up the environment variables, installing the required software, and configuring the system to use the GPU.
Once you have completed these steps, you should be able to use the GPU for tasks such as neural chat. If you continue to face issues, please provide more details about the specific errors or problems you are encountering during the setup process. If you need help, please @NeuralChatBot
Problem Summary and status of similar tests
I am having trouble getting neuralchat to work with my Intel Data Center Flex 170 GPU. Below is my procedure with the build_chatbot_on_xpu Jupyter notebook with a clean environment. I have tried this procedure multiple times and also attempted to follow different instructions from different sources but have the same outcome each time. When I get to the point of running the inference, I get either “Device does not exist” when I stick with the default device reference xpu or “Device is not supported” if I use xpu:0. I have tried this with several different Python versions, but use 3.9 below.
I have BigDL operational on this XPU and system (in a separate environment and not running during these tests below). I have also successfully used the deploy_chatbot_on_icx notebook (again in a separate environment and not running at the same time) using similar tweaks as outlined below to address missing dependencies in requirements.txt in my environment.
I also tried to get deploy_chatbot_on_xpu working (below I focus on build_chatbot_on_xpu). As long as I bring over the code from deploy_chatbot_on_cpu (to address the error relating to asyncio), I can successfully run the server but again get the error related to Device does not exist with device=’xpu’ and Device is not supported with device=’xpu:0’.
I am hoping to get feedback on what I am doing wrong so that I can operate neural chat and successfully employ the OpenAI APIs.
Installation Procedure
Required step for APT or offline installed oneAPI. Configure oneAPI environment variables. Skip this step for pip-installed oneAPI since LD_LIBRARY_PATH has already been configured.
source /opt/intel/oneapi/setvars.sh
Recommended Environment Variables
export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
sudo apt install -y intel-oneapi-dpcpp-cpp-2024.0 intel-oneapi-mkl-devel=2024.0.0-49656 # nothing is updated since the newest version is already installed from above python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
source {DPCPPROOT}/env/vars.sh source {MKLROOT}/env/vars.sh
source /opt/intel/oneapi/dpcpp-ct/2024.0/env/vars.sh source /opt/intel/oneapi/mkl/2024.0/env/vars.sh
python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.version); print(ipex.version); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
2.1.0a0+cxx11.abi 2.1.10+xpu [0]: _DeviceProperties(name='Intel(R) Data Center GPU Flex 170', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=13535MB, max_compute_units=512, gpu_eu_count=512)
wget https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/docs/notebooks/build_chatbot_on_xpu.ipynb wget https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_xpu.ipynb
pip install jupyter jupyter notebook --ip 0.0.0.0
2024-02-14 09:59:28 [ERROR] neuralchat error: Device does not exist Loading model Intel/neural-chat-7b-v3-1
AttributeError Traceback (most recent call last) Cell In[19], line 5 3 config = PipelineConfig(device='xpu') 4 chatbot = build_chatbot(config) ----> 5 response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.") 6 print(response)
AttributeError: 'NoneType' object has no attribute 'predict'
Inference : Text Chat Response – after changing device=’xpu’ to device=’xpu:0’
2024-02-14 10:01:53 [ERROR] neuralchat error: Device is not supported Loading model Intel/neural-chat-7b-v3-1
AttributeError Traceback (most recent call last) Cell In[23], line 5 3 config = PipelineConfig(device='xpu:0') 4 chatbot = build_chatbot(config) ----> 5 response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.") 6 print(response)
AttributeError: 'NoneType' object has no attribute 'predict'
/home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
AttributeError: 'NoneType' object has no attribute 'predict'
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning) Loading config settings from the environment... 2024-02-14 10:18:37.549692: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2024-02-14 10:18:37.553245: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-02-14 10:18:37.599354: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-02-14 10:18:37.599393: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-02-14 10:18:37.600837: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-02-14 10:18:37.609277: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-02-14 10:18:37.609563: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-02-14 10:18:38.533993: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-02-14 10:18:42,841 - datasets - INFO - PyTorch version 2.1.0a0+cxx11.abi available. 2024-02-14 10:18:42,841 - datasets - INFO - TensorFlow version 2.15.0.post1 available. Loading model Intel/neural-chat-7b-v3-1 Loading checkpoint shards: 100%| 2/2 [00:01<00:00, 1.23it/s] 2024-02-14 10:19:17,805 - root - ERROR - Exception: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES) 2024-02-14 10:19:17 [ERROR] neuralchat error: Generic error Traceback (most recent call last): File "/home/REDACTED/jupyter/./cputest.py", line 7, in/home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
AttributeError: 'NoneType' object has no attribute 'predict'
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning) Loading config settings from the environment... 2024-02-14 10:20:00.620315: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2024-02-14 10:20:00.623828: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-02-14 10:20:00.671369: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-02-14 10:20:00.671411: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-02-14 10:20:00.672846: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-02-14 10:20:00.681503: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-02-14 10:20:00.681998: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-02-14 10:20:01.604245: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-02-14 10:20:05,906 - datasets - INFO - PyTorch version 2.1.0a0+cxx11.abi available. 2024-02-14 10:20:05,906 - datasets - INFO - TensorFlow version 2.15.0.post1 available. Loading model Intel/neural-chat-7b-v3-1 2024-02-14 10:20:06 [ERROR] neuralchat error: Device is not supported Traceback (most recent call last): File "/home/REDACTED/jupyter/./cputest.py", line 7, in+-----------------------------+--------------------------------------------------------------------+ | Reset | N/A | | Programming Errors | N/A | | Driver Errors | N/A | | Cache Errors Correctable | N/A | | Cache Errors Uncorrectable | N/A | | Mem Errors Correctable | N/A | | Mem Errors Uncorrectable | N/A | +-----------------------------+--------------------------------------------------------------------+ | GPU Power (W) | 42 | | GPU Frequency (MHz) | 2050 | | Media Engine Freq (MHz) | 1025 | | GPU Core Temperature (C) | 58 | | GPU Memory Temperature (C) | N/A | | GPU Memory Read (kB/s) | 1452 | | GPU Memory Write (kB/s) | 400 | | GPU Memory Bandwidth (%) | 0 | | GPU Memory Used (MiB) | 31 | | GPU Memory Util (%) | 0 | | Xe Link Throughput (kB/s) | N/A | +-----------------------------+--------------------------------------------------------------------+
$ sudo xpu-smi health -d 0 +----------------------------+---------------------------------------------------------------------+ | Device ID | 0 | +----------------------------+---------------------------------------------------------------------+ | 1. GPU Core Temperature | Status: OK | | | Description: All temperature sensors are healthy. | | | Throttle Threshold: 100 Celsius Degree | | | Shutdown Threshold: 125 Celsius Degree | +----------------------------+---------------------------------------------------------------------+ | 3. GPU Power | Status: OK | | | Description: All power domains are healthy. | | | Throttle Threshold: 150 watts | +----------------------------+---------------------------------------------------------------------+ | 6. GPU Frequency | Status: OK | | | Description: The device frequency not throttled | +----------------------------+---------------------------------------------------------------------+ $ sudo xpu-smi diag --precheck Journal file /var/log/journal/90338a962e854ed39e4e7ece1f53d71e/user-1666601109@000610bd6beba70f-62677c8b509c641c.journal~ is truncated, ignoring file. Journal file /var/log/journal/90338a962e854ed39e4e7ece1f53d71e/user-1666601109@000610bd6beba70f-62677c8b509c641c.journal~ is truncated, ignoring file. +------------------+-------------------------------------------------------------------------------+ | Component | Details | +------------------+-------------------------------------------------------------------------------+ | Driver | Status: Pass | +------------------+-------------------------------------------------------------------------------+ | CPU | CPU ID: 0 | | | Status: Pass | +------------------+-------------------------------------------------------------------------------+ | CPU | CPU ID: 1 | | | Status: Pass | +------------------+-------------------------------------------------------------------------------+ | GPU | BDF: 0000:b3:00.0 | | | Status: Pass | +------------------+-------------------------------------------------------------------------------+ $ sudo xpu-smi diag -d 0 -l 3 +-------------------------------+------------------------------------------------------------------+ | Device ID | 0 | +-------------------------------+------------------------------------------------------------------+ | Level | 3 | | Result | Pass | | Items | 12 | +-------------------------------+------------------------------------------------------------------+ | Software Env Variables | Result: Pass | | | Message: Pass to check environment variables. | +-------------------------------+------------------------------------------------------------------+ | Software Library | Result: Pass | | | Message: Pass to check libraries. | +-------------------------------+------------------------------------------------------------------+ | Software Permission | Result: Pass | | | Message: Pass to check permission. | +-------------------------------+------------------------------------------------------------------+ | Software Exclusive | Result: Pass | | | Message: Pass to check the software exclusive. | +-------------------------------+------------------------------------------------------------------+ | Computation Check | Result: Pass | | | Message: Pass to check computation. | +-------------------------------+------------------------------------------------------------------+ | Integration PCIe | Result: Pass | | | Message: Pass to check PCIe bandwidth. Its bandwidth is 17.908 | | | GBPS. | +-------------------------------+------------------------------------------------------------------+ | Media Codec | Result: Pass | | | Message: Pass to check Media transcode performance. | | | 1080p H.265 : 305 FPS | | | 1080p H.264 : 306 FPS | | | 4K H.265 : 85 FPS | | | 4K H.264 : 84 FPS | +-------------------------------+------------------------------------------------------------------+ | Performance Computation | Result: Pass | | | Message: Pass to check computation performance. Its | | | single-precision GFLOPS is 11120.119. | +-------------------------------+------------------------------------------------------------------+ | Performance Power | Result: Pass | | | Message: Pass to check stress power. Its stress power is 119 W. | +-------------------------------+------------------------------------------------------------------+ | Performance Memory Bandwidth | Result: Pass | | | Message: Pass to check memory bandwidth. Its memory bandwidth | | | is 361.042 GBPS. | +-------------------------------+------------------------------------------------------------------+ | Performance Memory Allocation | Result: Pass | | | Message: Pass to check memory allocation. | +-------------------------------+------------------------------------------------------------------+ | Memory Error | Result: Pass | | | Message: Pass to check memory error. | +-------------------------------+------------------------------------------------------------------+