ROCm / rocminfo

ROCm Application for Reporting System Info
Other
32 stars 30 forks source link

Unable to open /dev/kfd read-write: Cannot allocate memory #41

Closed arjones85 closed 1 month ago

arjones85 commented 3 years ago

Trying to get rocm to work on a CentOS 8.3 node. Getting the following error:

[root@host log]# /opt/rocm/bin/rocminfo ROCk module is loaded Unable to open /dev/kfd read-write: Cannot allocate memory root is member of render group

Any assistance with where to look further on what the issue could be?

Thanks!

skeelyamd commented 3 years ago

The first step would be to make sure that you can access /dev/kfd. It looks like you should be able to but sometimes udev and other settings fight. You should also have access to any /dev/dri/renderD* device files. If you have access to all of these then you can try setting HSAKMT_DEBUG_LEVEL=7 which will emit some additional debugging information.

arjones85 commented 3 years ago

Thank you! Here's the output:

root@host ~]# ll /dev/kfd crw-rw-rw- 1 root render 235, 0 Apr 8 16:46 /dev/kfd [root@host ~]# [root@host ~]# ll /dev/dri/card0 crw-rw---- 1 root video 226, 0 Apr 8 16:46 /dev/dri/card0 [root@host ~]# [root@host dri]# export HSAKMT_DEBUG_LEVEL=7 [root@host dri]# /opt/rocm/bin/rocminfo ROCk module is loaded Unable to open /dev/kfd read-write: Cannot allocate memory root is member of render group [root@host dri]#

[root@host dri]# groups root video render

This node has four mi100 GPUs in it.

I checked dmesg and /var/log/messages and do not see any debug output. Is there somewhere else it is recorded?

For what it's worth this is a PXE booted node running a diskless image. I installed the ROCm drivers and then modprobe'd amdgpu.

skeelyamd commented 3 years ago

I don't see any /dev/dri/renderD files listed. card is not enough. You need the renderD* files to allocate device memory. If those are missing then it suggest that either the driver, dri, or libdrm is not installed or running correctly.

skeelyamd commented 3 years ago

Are you by chance running rocminfo a large number of times in parallel or running it in parallel with a large number of other processes?

skeelyamd commented 3 years ago

Can you check dmesg for lines like this:

[132328.363259] amdgpu: Failed to alloc doorbell for pdd [132328.363266] amdgpu: Failed to create process device data

This indicates that you have exhausted the entire doorbell BAR. There is a hard limit of 254 concurrent processes with an open handle to /dev/kfd.

arjones85 commented 3 years ago

Hello,

Thanks for the help! No I am not running it a large number of times, just once to generate the above output.

No dmesg lines aside from:

[root@host rocm]# dmesg | grep -i gpu [ 77.251525] [drm] amdgpu kernel modesetting enabled. [root@host rocm]#

Here's the installed dri/drm packages:

[root@host rocm]# rpm -qa | grep -i drm libdrm-devel-2.4.101-1.el8.x86_64 libdrm-2.4.101-1.el8.x86_64

[root@host rocm]# rpm -qa | grep -i dri mesa-dri-drivers-20.1.4-1.el8.x86_64

skeelyamd commented 3 years ago

Do you have /dev/dri/renderD* files? These control the GPU's on board memory.

The message "Unable to open /dev/kfd read-write: Cannot allocate memory" is generated from here: https://github.com/RadeonOpenCompute/rocminfo/blob/master/rocminfo.cc#L1060-L1067

Since this isn't opening a pipe, the error code indicates a lack of kernel memory. Assuming that your system is not entirely out of memory this implies a lack of access or space in GPU control space. Typically the limiting factor is one of the small PCIe BARs. However, the driver can not access these BARs at all if the VRAM control interfaces are missing, not because the driver uses the user space interfaces but because that implies that some component of the system isn't running.

fxkamd commented 3 years ago

Let's take a step back. First make sure the driver is installed correctly. What does "dkms status" report?

Then lets see a complete kernel log.

Finally, if the driver is installed correctly and still having problems, we can enable some extra debug messages to maybe get more information about the point of failure:

echo -n 'module amdgpu +pfl' > /sys/kernel/debug/dynamic_debug/control
ppanchad-amd commented 1 month ago

Closing ticket as there is no response from user. Thanks