RadeonOpenCompute / ROCm_Documentation

Legacy ROCm Software Platform Documentation
http://rocm.docs.amd.com
113 stars 92 forks source link

install ROCm has a mistake #107

Open 2017040264 opened 3 years ago

2017040264 commented 3 years ago

ubuntu20.04 + Radeon Rx Vega10 Graphics.

/opt/rocm/bin/rocminfo has a mistake: ROCk module is loaded Unable to open /dev/kfd read-write: Bad address cfl is member of render group hsa api call failure at: /src/rocminfo/rocminfo.cc:1142 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

how can I fix it ?

fxkamd commented 3 years ago

This should probably be reported in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

Are there any error messages in the dmesg log?

2017040264 commented 3 years ago

There has some error messages in the dmesg log,and I dont know how to fix it:

[    0.636557] pci 0000:00:00.2: AMD-Vi: Unable to read/write to IOMMU perf counter.

[    2.300156] snd_pci_acp3x 0000:03:00.5: Invalid ACP audio mode : 1

[    4.212601] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ERROR IB test failed on comp_1.0.1 (-110). [    5.236614] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ERROR IB test failed on comp_1.1.1 (-110). [    6.260365] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ERROR IB test failed on comp_1.2.1 (-110). [    7.284212] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ERROR IB test failed on comp_1.3.1 (-110). [    7.292712] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] ERROR ib ring test failed (-110).

------------------ 原始邮件 ------------------ 发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>; 发送时间: 2020年12月17日(星期四) 凌晨1:10 收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>; 抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)

This should probably be reported in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

Are there any error messages in the dmesg log?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

fxkamd commented 3 years ago

Something is clearly going wrong during driver initialization at boot time. I cannot give you a diagnosis from a few hand-picked error messages. That usually leads to incorrect conclusions. Please provide a complete kernel log, which will include a lot more context to work with: kernel version, boot parameters, PCI device list, memory map, other errors you may have missed, etc.

Can you also provide the output of "dkms status"?

2017040264 commented 3 years ago

I have uninstalled the ubuntu20.04  and install ubuntu18.04.5 LST,  and  the Rocm is installed successfully. Then I installed tensorflow and jupyter to test ,and the code is :

import tensorflow as tf tf.version tf.test.is_gpu_available()

but there is an error (red mark):

cfl@cfl-KPR-WX9:~/ts$ jupyter-notebook [I 18:01:22.614 NotebookApp] Serving notebooks from local directory: /home/cfl/ts [I 18:01:22.614 NotebookApp] 0 active kernels [I 18:01:22.614 NotebookApp] The Jupyter Notebook is running at: [I 18:01:22.614 NotebookApp] http://localhost:8888/?token=694b4d14330f2b194fa1a1c24e250f16d03b40e9ad650245 [I 18:01:22.614 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 18:01:22.615 NotebookApp]          Copy/paste this URL into your browser when you connect for the first time,     to login with a token:         http://localhost:8888/?token=694b4d14330f2b194fa1a1c24e250f16d03b40e9ad650245 [I 18:01:23.750 NotebookApp] Accepting one-time-token-authenticated connection from 127.0.0.1 [W 18:01:24.537 NotebookApp] 404 GET /i18n/zh-CN/LC_MESSAGES/nbjs.json?v=20201217180122 (127.0.0.1) 15.10ms referer=http://localhost:8888/tree [W 18:01:24.545 NotebookApp] 404 GET /static/components/moment/locale/zh-cn.js?v=20201217180122 (127.0.0.1) 2.06ms referer=http://localhost:8888/tree [W 18:01:26.569 NotebookApp] 404 GET /static/components/moment/locale/zh-cn.js?v=20201217180122 (127.0.0.1) 1.58ms referer=http://localhost:8888/notebooks/Untitled.ipynb [W 18:01:26.573 NotebookApp] 404 GET /nbextensions/widgets/notebook/js/extension.js?v=20201217180122 (127.0.0.1) 1.74ms referer=http://localhost:8888/notebooks/Untitled.ipynb [W 18:01:26.575 NotebookApp] 404 GET /i18n/zh-CN/LC_MESSAGES/nbjs.json?v=20201217180122 (127.0.0.1) 1.59ms referer=http://localhost:8888/notebooks/Untitled.ipynb [I 18:01:26.707 NotebookApp] Kernel started: 9eab146c-0971-422e-89db-e301e3558abb 2020-12-17 18:01:35.760935: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-12-17 18:01:35.789512: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2295605000 Hz 2020-12-17 18:01:35.790200: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4ae6580 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-12-17 18:01:35.790243: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version 2020-12-17 18:01:35.793787: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so /src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") [I 18:01:38.704 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports WARNING:root:kernel 9eab146c-0971-422e-89db-e301e3558abb restarted [I 18:03:26.701 NotebookApp] Saving file at /Untitled.ipynb

------------------ 原始邮件 ------------------ 发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>; 发送时间: 2020年12月17日(星期四) 下午2:38 收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>; 抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)

Something is clearly going wrong during driver initialization at boot time. I cannot give you a diagnosis from a few hand-picked error messages. That usually leads to incorrect conclusions. Please provide a complete kernel log, which will include a lot more context to work with: kernel version, boot parameters, PCI device list, memory map, other errors you may have missed, etc.

Can you also provide the output of "dkms status"?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

xuhuisheng commented 3 years ago

I suggest execute lspci -vt to show the information of GPU. BTW: rx Vega10 means RX Vega 64 or APU? ROCm can't support APU yet.

johnbridgman commented 3 years ago

Also please attach a kernel log / dmesg output as fxkamd suggested.

2017040264 commented 3 years ago

The dmseg log is in the enclosure. Please check it.

------------------ 原始邮件 ------------------ 发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>; 发送时间: 2020年12月17日(星期四) 晚上7:28 收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>; 抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)

Also please attach a kernel log / dmesg output as fxkamd suggested.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

2017040264 commented 3 years ago

lspci -vt: -[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 15d0 +-00.2 Advanced Micro Devices, Inc. [AMD] Device 15d1 +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-01.2-[01]----00.0 Intel Corporation Wireless 8265 / 8275 +-01.7-[02]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-08.1-[03]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Picasso | +-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Device 15de | +-00.2 Advanced Micro Devices, Inc. [AMD] Device 15df | +-00.3 Advanced Micro Devices, Inc. [AMD] Device 15e0 | +-00.4 Advanced Micro Devices, Inc. [AMD] Device 15e1 | +-00.5 Advanced Micro Devices, Inc. [AMD] Device 15e2 | -00.6 Advanced Micro Devices, Inc. [AMD] Device 15e3 +-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller +-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge +-18.0 Advanced Micro Devices, Inc. [AMD] Device 15e8 +-18.1 Advanced Micro Devices, Inc. [AMD] Device 15e9 +-18.2 Advanced Micro Devices, Inc. [AMD] Device 15ea +-18.3 Advanced Micro Devices, Inc. [AMD] Device 15eb +-18.4 Advanced Micro Devices, Inc. [AMD] Device 15ec +-18.5 Advanced Micro Devices, Inc. [AMD] Device 15ed +-18.6 Advanced Micro Devices, Inc. [AMD] Device 15ee -18.7 Advanced Micro Devices, Inc. [AMD] Device 15ef But I don't know what does that means...

I think RX Vega10 is a GPU ,because I see the GPU mark in the windows10, but I don't know whether it means RX Vega 64.

My laptop is HONOR(HUAWEI) MagicBook 2019 with AMD 3700U.

------------------ 原始邮件 ------------------ 发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>; 发送时间: 2020年12月17日(星期四) 晚上7:22 收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>; 抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)

I suggest execute lspci -vt to show the information of GPU. BTW: rx Vega10 means RX Vega 64 or APU? ROCm can't support APU yet.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

johnbridgman commented 3 years ago

The dmseg log is in the enclosure. Please check it.

Thanks, but I don't see an attachment. Looks like you responded by email rather than the web page - it's possible that attachments via email don't get included. The web page dialog suggests that you have to drag & drop or paste attachments.

Looks like your GPU is the integrated GPU of a Picasso (3700U) so as fxkamd mentioned it's not officially supported under HIP yet. @fxkamd I think Picasso is the first APU where we used GPUVM code paths rather than ATC/IOMMU but I don't know if that helps at all.

fxkamd commented 3 years ago

Picasso is the same as Raven. It uses the IOMMUv2 code path by default. But we recently added fallbacks for systems with disabled IOMMUv2 or broken/missing CRAT tables where we treat it as a dGPU. I'm not sure whether that has made it into ROCm release branches yet.

The error message "/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")" comes from the HIP runtime. It looks like the GPU code was not compiled for the correct ISA for your GPU.

2017040264 commented 3 years ago

dmesg.log

Here is the dmesg log.

And whether my GPU will be supported in the future?

2017040264 commented 3 years ago

capture_20201218071612289