Open 2017040264 opened 3 years ago
This should probably be reported in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
Are there any error messages in the dmesg log?
There has some error messages in the dmesg log,and I dont know how to fix it:
[ 0.636557] pci 0000:00:00.2: AMD-Vi: Unable to read/write to IOMMU perf counter.
[ 2.300156] snd_pci_acp3x 0000:03:00.5: Invalid ACP audio mode : 1
[ 4.212601] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ERROR IB test failed on comp_1.0.1 (-110). [ 5.236614] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ERROR IB test failed on comp_1.1.1 (-110). [ 6.260365] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ERROR IB test failed on comp_1.2.1 (-110). [ 7.284212] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ERROR IB test failed on comp_1.3.1 (-110). [ 7.292712] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] ERROR ib ring test failed (-110).
------------------ 原始邮件 ------------------ 发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>; 发送时间: 2020年12月17日(星期四) 凌晨1:10 收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>; 抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)
This should probably be reported in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
Are there any error messages in the dmesg log?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Something is clearly going wrong during driver initialization at boot time. I cannot give you a diagnosis from a few hand-picked error messages. That usually leads to incorrect conclusions. Please provide a complete kernel log, which will include a lot more context to work with: kernel version, boot parameters, PCI device list, memory map, other errors you may have missed, etc.
Can you also provide the output of "dkms status"?
I have uninstalled the ubuntu20.04 and install ubuntu18.04.5 LST, and the Rocm is installed successfully. Then I installed tensorflow and jupyter to test ,and the code is :
import tensorflow as tf tf.version tf.test.is_gpu_available()
but there is an error (red mark):
cfl@cfl-KPR-WX9:~/ts$ jupyter-notebook [I 18:01:22.614 NotebookApp] Serving notebooks from local directory: /home/cfl/ts [I 18:01:22.614 NotebookApp] 0 active kernels [I 18:01:22.614 NotebookApp] The Jupyter Notebook is running at: [I 18:01:22.614 NotebookApp] http://localhost:8888/?token=694b4d14330f2b194fa1a1c24e250f16d03b40e9ad650245 [I 18:01:22.614 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 18:01:22.615 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=694b4d14330f2b194fa1a1c24e250f16d03b40e9ad650245 [I 18:01:23.750 NotebookApp] Accepting one-time-token-authenticated connection from 127.0.0.1 [W 18:01:24.537 NotebookApp] 404 GET /i18n/zh-CN/LC_MESSAGES/nbjs.json?v=20201217180122 (127.0.0.1) 15.10ms referer=http://localhost:8888/tree [W 18:01:24.545 NotebookApp] 404 GET /static/components/moment/locale/zh-cn.js?v=20201217180122 (127.0.0.1) 2.06ms referer=http://localhost:8888/tree [W 18:01:26.569 NotebookApp] 404 GET /static/components/moment/locale/zh-cn.js?v=20201217180122 (127.0.0.1) 1.58ms referer=http://localhost:8888/notebooks/Untitled.ipynb [W 18:01:26.573 NotebookApp] 404 GET /nbextensions/widgets/notebook/js/extension.js?v=20201217180122 (127.0.0.1) 1.74ms referer=http://localhost:8888/notebooks/Untitled.ipynb [W 18:01:26.575 NotebookApp] 404 GET /i18n/zh-CN/LC_MESSAGES/nbjs.json?v=20201217180122 (127.0.0.1) 1.59ms referer=http://localhost:8888/notebooks/Untitled.ipynb [I 18:01:26.707 NotebookApp] Kernel started: 9eab146c-0971-422e-89db-e301e3558abb 2020-12-17 18:01:35.760935: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-12-17 18:01:35.789512: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2295605000 Hz 2020-12-17 18:01:35.790200: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4ae6580 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-12-17 18:01:35.790243: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-12-17 18:01:35.793787: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so /src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") [I 18:01:38.704 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports WARNING:root:kernel 9eab146c-0971-422e-89db-e301e3558abb restarted [I 18:03:26.701 NotebookApp] Saving file at /Untitled.ipynb
------------------ 原始邮件 ------------------ 发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>; 发送时间: 2020年12月17日(星期四) 下午2:38 收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>; 抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)
Something is clearly going wrong during driver initialization at boot time. I cannot give you a diagnosis from a few hand-picked error messages. That usually leads to incorrect conclusions. Please provide a complete kernel log, which will include a lot more context to work with: kernel version, boot parameters, PCI device list, memory map, other errors you may have missed, etc.
Can you also provide the output of "dkms status"?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I suggest execute lspci -vt
to show the information of GPU.
BTW: rx Vega10 means RX Vega 64 or APU? ROCm can't support APU yet.
Also please attach a kernel log / dmesg output as fxkamd suggested.
The dmseg log is in the enclosure. Please check it.
------------------ 原始邮件 ------------------ 发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>; 发送时间: 2020年12月17日(星期四) 晚上7:28 收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>; 抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)
Also please attach a kernel log / dmesg output as fxkamd suggested.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
lspci -vt: -[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 15d0 +-00.2 Advanced Micro Devices, Inc. [AMD] Device 15d1 +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-01.2-[01]----00.0 Intel Corporation Wireless 8265 / 8275 +-01.7-[02]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-08.1-[03]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Picasso | +-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Device 15de | +-00.2 Advanced Micro Devices, Inc. [AMD] Device 15df | +-00.3 Advanced Micro Devices, Inc. [AMD] Device 15e0 | +-00.4 Advanced Micro Devices, Inc. [AMD] Device 15e1 | +-00.5 Advanced Micro Devices, Inc. [AMD] Device 15e2 | -00.6 Advanced Micro Devices, Inc. [AMD] Device 15e3 +-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller +-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge +-18.0 Advanced Micro Devices, Inc. [AMD] Device 15e8 +-18.1 Advanced Micro Devices, Inc. [AMD] Device 15e9 +-18.2 Advanced Micro Devices, Inc. [AMD] Device 15ea +-18.3 Advanced Micro Devices, Inc. [AMD] Device 15eb +-18.4 Advanced Micro Devices, Inc. [AMD] Device 15ec +-18.5 Advanced Micro Devices, Inc. [AMD] Device 15ed +-18.6 Advanced Micro Devices, Inc. [AMD] Device 15ee -18.7 Advanced Micro Devices, Inc. [AMD] Device 15ef But I don't know what does that means...
I think RX Vega10 is a GPU ,because I see the GPU mark in the windows10, but I don't know whether it means RX Vega 64.
My laptop is HONOR(HUAWEI) MagicBook 2019 with AMD 3700U.
------------------ 原始邮件 ------------------ 发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>; 发送时间: 2020年12月17日(星期四) 晚上7:22 收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>; 抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)
I suggest execute lspci -vt to show the information of GPU. BTW: rx Vega10 means RX Vega 64 or APU? ROCm can't support APU yet.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
The dmseg log is in the enclosure. Please check it.
Thanks, but I don't see an attachment. Looks like you responded by email rather than the web page - it's possible that attachments via email don't get included. The web page dialog suggests that you have to drag & drop or paste attachments.
Looks like your GPU is the integrated GPU of a Picasso (3700U) so as fxkamd mentioned it's not officially supported under HIP yet. @fxkamd I think Picasso is the first APU where we used GPUVM code paths rather than ATC/IOMMU but I don't know if that helps at all.
Picasso is the same as Raven. It uses the IOMMUv2 code path by default. But we recently added fallbacks for systems with disabled IOMMUv2 or broken/missing CRAT tables where we treat it as a dGPU. I'm not sure whether that has made it into ROCm release branches yet.
The error message "/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")" comes from the HIP runtime. It looks like the GPU code was not compiled for the correct ISA for your GPU.
ubuntu20.04 + Radeon Rx Vega10 Graphics.
/opt/rocm/bin/rocminfo has a mistake: ROCk module is loaded Unable to open /dev/kfd read-write: Bad address cfl is member of render group hsa api call failure at: /src/rocminfo/rocminfo.cc:1142 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
how can I fix it ?