Xilinx / mlir-aie

An MLIR-based toolchain for AMD AI Engine-enabled devices.
Other
303 stars 86 forks source link

"0 devices found" after trying to upgrade xrt #1618

Open ngdxzy opened 3 months ago

ngdxzy commented 3 months ago

I was trying to use the new xdna-driver. Therefore, I followed the installation of xdna driver for linux. However, in the end, the NPU device is not found when running xbutil examine. The output is shown below:

System Configuration
  OS Name              : Linux
  Release              : 6.8.8+
  Version              : #2 SMP PREEMPT_DYNAMIC Fri Jul 12 16:44:38 EDT 2024
  Machine              : x86_64
  CPU Cores            : 16
  Memory               : 28884 MB
  Distribution         : Ubuntu 22.04.4 LTS
  GLIBC                : 2.35
  Model                : NucBox K8
  BIOS vendor          : American Megatrends International, LLC.
  BIOS version         : NucBox K8 1.07

XRT
  Version              : 2.18.0
  Branch               : HEAD
  Hash                 : 73fe5440974fc51ccaba6366094e4bfa8151f79a
  Hash Date            : 2024-07-12 18:42:09
  XOCL                 : unknown, unknown
  XCLMGMT              : unknown, unknown
WARNING: xclmgmt version is unknown. Is xclmgmt driver loaded? Or is MSD/MPD running?
  AMDXDNA              : 2.18.0_20240712, b6db49f792a48123a016ba052d0c2103862547ee

Devices present
  0 devices found

I checked the dmesg. I found the amdxdna driver failed to load:

[    1.982433] kernel: amdxdna: loading out-of-tree module taints kernel.
[    1.982439] kernel: amdxdna: module verification failed: signature and/or required key missing - tainting kernel
[    1.986184] kernel: amdxdna 0000:67:00.1: loading /lib/firmware/amdnpu/1502_00/npu.sbin failed with error -22
[    1.986188] kernel: amdxdna 0000:67:00.1: Direct firmware load for amdnpu/1502_00/npu.sbin failed with error -22
[    1.986190] kernel: amdxdna 0000:67:00.1: aie2_init: failed to request_firmware amdnpu/1502_00/npu.sbin, ret -22
[    1.986223] kernel: amdxdna 0000:67:00.1: amdxdna_probe: Hardware init failed, ret -22
[    1.986245] kernel: amdxdna: probe of 0000:67:00.1 failed with error -22

I also checked the PCI information:

67:00.1 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Device 1502
    Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1502
    Flags: fast devsel, IRQ 255, IOMMU group 27
    Memory at dc900000 (32-bit, non-prefetchable) [disabled] [size=512K]
    Memory at dc9c0000 (32-bit, non-prefetchable) [disabled] [size=8K]
    Memory at 7c10000000 (64-bit, prefetchable) [disabled] [size=256K]
    Memory at dc980000 (32-bit, non-prefetchable) [disabled] [size=256K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
    Capabilities: [64] Express Endpoint, MSI 00
    Capabilities: [a0] MSI: Enable- Count=1/16 Maskable- 64bit+
    Capabilities: [c0] MSI-X: Enable- Count=16 Masked-
    Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150] Advanced Error Reporting
    Capabilities: [2a0] Access Control Services
    Capabilities: [2d0] Process Address Space ID (PASID)
    Kernel modules: amdxdna

May I ask how to solve this?

stephenneuendorffer commented 3 months ago

I'm not sure if this is your problem, but I found I had to explicitly remove and reinstall the xrt plugin when upgrading.

ngdxzy commented 3 months ago

I'm not sure if this is your problem, but I found I had to explicitly remove and reinstall the xrt plugin when upgrading.

I did remove the old xrt. The only thing I am not sure of is that I compiled a new Linux kernel based on the old one. Does it mean I am required to reinstall the operating system?

stephenneuendorffer commented 3 months ago

That shouldn't be necessary. Are you sure there were no errors when compiling/installing the xrt_plugin module? The fact that it can't find the firmware is very suspicious.

ngdxzy commented 3 months ago

I found the problem and I guess it is a BUG that is required to be fixed. In the tutorial (https://github.com/Xilinx/mlir-aie/blob/main/docs/buildHostLin.md), it asks us to switch to an specific commit with:

git reset --hard b6db49f792a48123a016ba052d0c2103862547ee

In this commit, when running the:

cd $XDNA_SRC_DIR/build
./build.sh -release
./build.sh -package

I found ./build.sh -package tries to download npu.sbin from a URL defined in <tools/info.json>, and in this commit, the URL is: https://gitlab.freedesktop.org/drm/firmware/-/raw/amd-ipu-staging/amdnpu/1502_00/npu.sbin.1.4.1.309

However, I found that this URL is not working anymore. Therefore, there will be 404 error. However, an empty file will be generated so everything will run smoothly and no other errors will show up. To solve this, I have to manually change the info.json to the new URL: https://gitlab.freedesktop.org/drm/firmware/-/raw/amd-ipu-staging/amdnpu/1502_00/npu.sbin.1.4.2.323

Now, it works. I believe either the mlir-aie repository or the xdna-driver repository has to change and has this problem fixed.