Closed Sfinx closed 5 months ago
Additional info: laptop BIOS has three settings for AMD GPU - Hybrid/SAG 1.5/dGPU. When set to dGPU the 'xbutil validate' completes without error.
Disregard the BIOS setting comment. It is appeared that bug can't be reproduced at fresh reboot
Steps needed to reproduce the bug after fresh reboot, just run once the stock xdna-driver example:
xdna-driver/build/example_build$ ./example_noop_test /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
the next 'xbutil validate' wll lead to boom.
BTW: sometime I'm seeing this while running the stock example:
amdxdna 0000:6a:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x7c79c0c61100 flags=0x0027]
@Sfinx The "general protection fault" issue is fixed by #47.
I cannot reproduce the IO_PAGE_FAULT issue with xdna-driver example on my board yet. Please let me know if you still see it with #47.
This issue not about IO_PAGE_FAULT xdna-driver example but about oops in amdxdna_flush() while running the 'xbutil validate'. Will check #47
Can't reproduce the oops with #47 applied. Thanks !
Closed too early ;) Oopsed again :
xbutil examine
System Configuration
OS Name : Linux
Release : 6.8.7-060807-generic
Version : #202404170934 SMP PREEMPT_DYNAMIC Thu Apr 18 13:01:01 EEST 2024
Machine : x86_64
CPU Cores : 16
Memory : 95777 MB
Distribution : Ubuntu 22.04.4 LTS
GLIBC : 2.35
Model : TUXEDO Sirius 16 Gen1
BIOS vendor : American Megatrends International, LLC.
BIOS version : V1.00A00_20240108
XRT
Version : 2.17.0
Branch : HEAD
Hash : baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
Hash Date : 2024-04-20 09:27:29
XOCL : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
XCLMGMT : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
AMDXDNA : 2.17.0_20240417, 35351e4bbbc65568669c36255825425030be721f
Devices present
BDF : Name
---------------------------------
[0000:6a:00.1] : RyzenAI-npu1
Userspace:
------------------------------------------------------------
EARLY ACCESS
This release of xbutil contains early access
experimental features which may have bugs.
------------------------------------------------------------
Validate Device : [0000:6a:00.1]
Platform : RyzenAI-npu1
-------------------------------------------------------------------------------
Test 1 [0000:6a:00.1] : verify
Description : Run 'Hello World' test on IPU
Xclbin : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
Details : Kernel name is 'DPU_PDI_0'
Total duration: '1.1's
Average throughput: '9497.7' ops/s
Average latency: '105.3' us
Test Status : [PASSED]
-------------------------------------------------------------------------------
[ <-> ]: Running Test... < 2s >
[ <-> ]: Running Test... < 12s >
terminate called recursively
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
Kernel:
All xdna-driver API freezed after oops so reboot is needed
Hmm, noticed that hash for amdxdna driver is still old though I've rebulded & reinstalled it as stated in doc. Seems like 'build -clean' had to be issued. Starting over..
Okay, validate still crashes but with another message:
System Configuration
OS Name : Linux
Release : 6.8.7-060807-generic
Version : #202404170934 SMP PREEMPT_DYNAMIC Thu Apr 18 13:01:01 EEST 2024
Machine : x86_64
CPU Cores : 16
Memory : 95777 MB
Distribution : Ubuntu 22.04.4 LTS
GLIBC : 2.35
Model : TUXEDO Sirius 16 Gen1
BIOS vendor : American Megatrends International, LLC.
BIOS version : V1.00A00_20240108
XRT
Version : 2.17.0
Branch : HEAD
Hash : baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
Hash Date : 2024-04-20 12:05:46
XOCL : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
XCLMGMT : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
AMDXDNA : 2.17.0_20240420, e9eca7b2714afa83949608a8418a08a6a070973c
Devices present
BDF : Name
---------------------------------
[0000:6a:00.1] : RyzenAI-npu1
Userspace:
Verbose: Enabling Verbosity
------------------------------------------------------------
EARLY ACCESS
This release of xbutil contains early access
experimental features which may have bugs.
------------------------------------------------------------
Validate Device : [0000:6a:00.1]
Platform : RyzenAI-npu1
-------------------------------------------------------------------------------
Test 1 [0000:6a:00.1] : verify
Description : Run 'Hello World' test on IPU
Xclbin : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
Details : Kernel name is 'DPU_PDI_0'
Total duration: '1.0's
Average throughput: '9599.0' ops/s
Average latency: '104.2' us
Test Status : [PASSED]
-------------------------------------------------------------------------------
[ <-> ]: Running Test... < 18s >
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
what(): boost::too_many_args: format-string referred to fewer arguments than were passed
/opt/xilinx/xrt/bin/unwrapped/loader: line 61: 19690 Aborted (core dumped) "${XRT_PROG_UNWRAPPED}" "${XRT_LOADER_ARGS[@]}"
Kernel:
Hi @Sfinx , the kernel oops issue is fixed by #51. But if following your steps, I still have chance to see xbutil validate print,
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
what(): boost::too_many_args: format-string referred to fewer arguments than were passed
/opt/xilinx/xrt/bin/unwrapped/loader: line 61: 19690 Aborted (core dumped) "${XRT_PROG_UNWRAPPED}" "${XRT_LOADER_ARGS[@]}"
Below is my gdb backtrack. It looks like this is a bug in XRT library. I will forward this issue to XRT.
terminate called recursively
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
what(): boost::too_many_args: format-string referred to fewer arguments than were passed
Thread 4 "xbutil2" received signal SIGABRT, Aborted.
[Switching to Thread 0x15554d600640 (LWP 147774)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=23456114542144) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=23456114542144) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=23456114542144) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=23456114542144, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x0000155554a42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x0000155554a287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x0000155554eb042a in __gnu_cxx::__verbose_terminate_handler() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x0000155554eae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000155554eae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x0000155554eae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00005555555d5fb6 in void boost::io::detail::distribute<char, std::char_traits<char>, std::allocator<char>, boost::io::detail::put_holder<char, std::char_traits<char> > const&>(boost::basic_format<char, std::char_traits<char>, std::allocator<char> >&, boost::io::deta
il::put_holder<char, std::char_traits<char> > const&) ()
#10 0x00005555555d5fff in boost::basic_format<char, std::char_traits<char>, std::allocator<char> >& boost::io::detail::feed_impl<char, std::char_traits<char>, std::allocator<char>, boost::io::detail::put_holder<char, std::char_traits<char> > const&>(boost::basic_format<ch
ar, std::char_traits<char>, std::allocator<char> >&, boost::io::detail::put_holder<char, std::char_traits<char> > const&) ()
#11 0x00005555555cecfb in XBUtilities::BusyBar::update() ()
#12 0x0000155554edc253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x0000155554a94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x0000155554b26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thanks for such fast turnaround, testing #51 now..
You are right - this simple xdna-driver example do not catch default exception not sure about XRT bug.
BTW: Where I should report kernel iommu patch issues ? I've started seeing the 'amdgpu: [gfxhub] page fault' messages after the patch applied while using normal desktop apps (no xdna tasks involved)
Still glitching:
System Configuration
OS Name : Linux
Release : 6.8.7-060807-generic
Version : #202404170934 SMP PREEMPT_DYNAMIC Sat Apr 20 14:33:18 EEST 2024
Machine : x86_64
CPU Cores : 16
Memory : 95777 MB
Distribution : Ubuntu 22.04.4 LTS
GLIBC : 2.35
Model : TUXEDO Sirius 16 Gen1
BIOS vendor : American Megatrends International, LLC.
BIOS version : V1.00A00_20240108
XRT
Version : 2.17.0
Branch : HEAD
Hash : baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
Hash Date : 2024-04-20 12:05:46
XOCL : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
XCLMGMT : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
AMDXDNA : 2.17.0_20240423, 70f709eda4363af0a9a5824786e2747a7fadf345
Devices present
BDF : Name
---------------------------------
[0000:6a:00.1] : RyzenAI-npu1
------------------------------------------------------------
EARLY ACCESS
This release of xbutil contains early access
experimental features which may have bugs.
------------------------------------------------------------
Validate Device : [0000:6a:00.1]
Platform : RyzenAI-npu1
-------------------------------------------------------------------------------
Test 1 [0000:6a:00.1] : verify
Description : Run 'Hello World' test on IPU
Xclbin : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
Details : Kernel name is 'DPU_PDI_0'
Total duration: '1.6's
Average throughput: '6305.2' ops/s
Average latency: '158.6' us
Test Status : [PASSED]
-------------------------------------------------------------------------------
[ <-> ]: Running Test... < 2s >
[ <-> ]: Running Test... < 21s >
terminate called recursively
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
/opt/xilinx/xrt/bin/unwrapped/loader: line 61: 94101 Aborted (core dumped) "${XRT_PROG_UNWRAPPED}" "${XRT_LOADER_ARGS[@]}"
Kernel:
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73effc2b00 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c734ffc3a00 flags=0x0030]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73cffc4700 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c736ffc1f00 flags=0x0030]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c736ffc2000 flags=0x0030]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73effc3000 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c734ffc4000 flags=0x0030]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73cffc5000 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73cffc5800 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c736ffc3000 flags=0x0030]
[Tue Apr 23 15:05:59 2024] amdxdna 0000:6a:00.1: npu_send_mgmt_msg_wait: command opcode 0x3 failed, status 0x2000006
[Tue Apr 23 15:05:59 2024] amdxdna 0000:6a:00.1: npu1_destroy_context: hwctx.94536.0 destroy context failed, ret -22
[Tue Apr 23 15:05:59 2024] amdxdna 0000:6a:00.1: npu1_xrs_unload: destroy context failed, ret -22
Good news: the above error is recoverable and no reboot needed anymore
After restart it gave nearly the same kernel message after some time:
[Tue Apr 23 15:50:39 2024] amd_iommu_report_page_fault: 344 callbacks suppressed
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d06af8c800 flags=0x0030]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d04af8e600 flags=0x0030]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0eaf8d500 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0caf8f300 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d06af8d000 flags=0x0030]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d04af8f000 flags=0x0030]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0eaf8e000 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0caf90000 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0caf90800 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d06af8de00 flags=0x0030]
[Tue Apr 23 15:50:39 2024] AMD-Vi: IOMMU Event log restarting
@Sfinx , I still cannot reproduce the IO_PAGE_FAULE issue. I have create ticket to the test case owner. Hope we can address the root cause soon. Before that, let's keep this issue open.
Good to know that you don't need reboot to recover. That is what #51 fixed.
To reproduce just run some time the 'xbutil validate' and './example_build/example_noop_test './tools/bins/1502_00/validate.xclbin' in parallel. The xbutil will segfault each ~10 mins btw. I guess this is not counted as stress test
The xdna device is "0000:6a:00.1", but IO_PAGE_FAULT happens on 6a:00.0. Can you run and post return here? sudo lspci -vvs 6a:00.0
6a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 14ec
Subsystem: Advanced Micro Devices, Inc. [AMD] Device 14ec
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
IOMMU group: 31
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [64] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x16 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [d0] SATA HBA v1.0 InCfgSpace
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [2a0 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [450 v1] Lane Margining at the Receiver <?>
PCI tree:
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14e8
+-00.2 Advanced Micro Devices, Inc. [AMD] Device 14e9
+-01.0 Advanced Micro Devices, Inc. [AMD] Device 14ea
+-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7700S/7600/7600S/7600M XT/PRO W7600]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 HDMI/DP Audio
+-01.2-[04]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
+-02.0 Advanced Micro Devices, Inc. [AMD] Device 14ea
+-02.1-[05]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
+-02.2-[06]----00.0 Intel Corporation Wi-Fi 6E(802.11ax) AX210/AX1675* 2x2 [Typhoon Peak]
+-02.3-[07]--
+-02.4-[08]----00.0 Realtek Semiconductor Co., Ltd. RTS5762 NVMe SSD Controller
+-03.0 Advanced Micro Devices, Inc. [AMD] Device 14ea
+-03.1-[09-68]--
+-04.0 Advanced Micro Devices, Inc. [AMD] Device 14ea
+-08.0 Advanced Micro Devices, Inc. [AMD] Device 14ea
+-08.1-[69]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1
| +-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt Radeon High Definition Audio Controller
| +-00.2 Advanced Micro Devices, Inc. [AMD] Family 19h (Model 74h) CCP/PSP 3.0 Device
| +-00.3 Advanced Micro Devices, Inc. [AMD] Device 15b9
| +-00.4 Advanced Micro Devices, Inc. [AMD] Device 15ba
| +-00.5 Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor
| \-00.6 Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
+-08.2-[6a]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14ec
| \-00.1 Advanced Micro Devices, Inc. [AMD] AMD IPU Device
+-08.3-[6b]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 14ec
| +-00.3 Advanced Micro Devices, Inc. [AMD] Device 15c0
| +-00.4 Advanced Micro Devices, Inc. [AMD] Device 15c1
| \-00.5 Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller #1
+-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
+-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
+-18.0 Advanced Micro Devices, Inc. [AMD] Device 14f0
+-18.1 Advanced Micro Devices, Inc. [AMD] Device 14f1
+-18.2 Advanced Micro Devices, Inc. [AMD] Device 14f2
+-18.3 Advanced Micro Devices, Inc. [AMD] Device 14f3
+-18.4 Advanced Micro Devices, Inc. [AMD] Device 14f4
+-18.5 Advanced Micro Devices, Inc. [AMD] Device 14f5
+-18.6 Advanced Micro Devices, Inc. [AMD] Device 14f6
\-18.7 Advanced Micro Devices, Inc. [AMD] Device 14f7
Just trying with latest XDNA and XRT driver and it is still working.
Interestingly, I did not know about xbutil validate
for the NPU. ;-) @mamin506 where is it documented?
I run on a daily basis the validation of https://github.com/Xilinx/mlir-aie with check-aie
target on my laptop and I have been lucky for a few weeks now.
@Sfinx Is your kernel far from https://github.com/AMD-SW/linux/tree/v6.8.7-iommu-sva-part4-v7 ?
rkeryell@rk-xsj:~/Xilinx/Projects/AIE/xdna-driver/build (main)$ xbutil examine
System Configuration
OS Name : Linux
Release : 6.8.7+iommu-sva-part4-v7+
Version : #1 SMP PREEMPT_DYNAMIC Fri Apr 19 09:35:16 PDT 2024
Machine : x86_64
CPU Cores : 16
Memory : 63575 MB
Distribution : Ubuntu 23.10
GLIBC : 2.38
Model : HP ZBook Power 15.6 inch G10 A Mobile Workstation PC
BIOS vendor : HP
BIOS version : V85 Ver. 01.04.00
XRT
Version : 2.17.0
Branch : master
Hash : daf9d07d92ccb8f004f5d8d677e6c855b03514c1
Hash Date : 2024-04-24 09:19:33
XOCL : 2.17.0, daf9d07d92ccb8f004f5d8d677e6c855b03514c1
XCLMGMT : 2.17.0, daf9d07d92ccb8f004f5d8d677e6c855b03514c1
AMDXDNA : 2.17.0_20240424, a2e2ad3c0ea096bf035ae0f439b0350d997bfe7b
Firmware Version : N/A
Devices present
BDF : Name
---------------------------------
[0000:66:00.1] : RyzenAI-npu1
rkeryell@rk-xsj:~/Xilinx/Projects/AIE/xdna-driver/build (main)$ xbutil validate --device 0000:66:00.1
------------------------------------------------------------
EARLY ACCESS
This release of xbutil contains early access
experimental features which may have bugs.
------------------------------------------------------------
Validate Device : [0000:66:00.1]
Platform : RyzenAI-npu1
Performance Mode : Default
-------------------------------------------------------------------------------
Test 1 [0000:66:00.1] : verify
Details : Kernel name is 'DPU_PDI_0'
Instruction size: '20' bytes
No. of iterations: '10000'
Average throughput: '21173.6' ops/s
Average latency: '94.5' us
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 2 [0000:66:00.1] : df-bw
Details : Kernel name is 'DPU_PDI_0'
Details : Buffer size: '1'GB
No. of iterations: '600'
Total duration: '85.4's
Average bandwidth per shim DMA: '14.1' GB/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:66:00.1] : tct-one-col
Details : Kernel name is 'DPU_PDI_0'
Details : Buffer size: '4'bytes
No. of iterations: '10000'
Average time for TCT: '4.0' us
Average TCT throughput: '247076.2' TCT/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 4 [0000:66:00.1] : tct-all-col
Details : Kernel name is 'DPU_PDI_0'
Details : Buffer size: '4' bytes
No. of iterations: '20000'
Average time for TCT: '2.0' us
Average TCT throughput: '498471.8' TCT/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Validation completed. Please run the command '--verbose' option for more details
rkeryell@rk-xsj:~/Xilinx/Projects/AIE/xdna-driver/build (main)$ xbutil validate --device 0000:66:00.1
------------------------------------------------------------
EARLY ACCESS
This release of xbutil contains early access
experimental features which may have bugs.
------------------------------------------------------------
Validate Device : [0000:66:00.1]
Platform : RyzenAI-npu1
Performance Mode : Default
-------------------------------------------------------------------------------
Test 1 [0000:66:00.1] : verify
Details : Kernel name is 'DPU_PDI_0'
Instruction size: '20' bytes
No. of iterations: '10000'
Average throughput: '21260.4' ops/s
Average latency: '94.4' us
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 2 [0000:66:00.1] : df-bw
Details : Kernel name is 'DPU_PDI_0'
Details : Buffer size: '1'GB
No. of iterations: '600'
Total duration: '86.6's
Average bandwidth per shim DMA: '13.9' GB/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:66:00.1] : tct-one-col
Details : Kernel name is 'DPU_PDI_0'
Details : Buffer size: '4'bytes
No. of iterations: '10000'
Average time for TCT: '4.0' us
Average TCT throughput: '248398.0' TCT/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 4 [0000:66:00.1] : tct-all-col
Details : Kernel name is 'DPU_PDI_0'
Details : Buffer size: '4' bytes
No. of iterations: '20000'
Average time for TCT: '2.0' us
Average TCT throughput: '499935.2' TCT/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Validation completed. Please run the command '--verbose' option for more details
rkeryell@rk-xsj:~/Xilinx/Projects/AIE/xdna-driver/build (main)$ xbutil validate --device 0000:66:00.1 --verbose
Verbose: Enabling Verbosity
------------------------------------------------------------
EARLY ACCESS
This release of xbutil contains early access
experimental features which may have bugs.
------------------------------------------------------------
Validate Device : [0000:66:00.1]
Platform : RyzenAI-npu1
Performance Mode : Default
-------------------------------------------------------------------------------
Test 1 [0000:66:00.1] : verify
Description : Run end-to-end latency and throughput test on NPU
Xclbin : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
Details : Kernel name is 'DPU_PDI_0'
Instruction size: '20' bytes
No. of iterations: '10000'
Average throughput: '21319.9' ops/s
Average latency: '94.8' us
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 2 [0000:66:00.1] : df-bw
Description : Run bandwidth test on data fabric
Xclbin : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
Details : Kernel name is 'DPU_PDI_0'
DPU-Sequence : /opt/xilinx/xrt/amdxdna/bins/dpu_sequence/df_bw.txt
Details : Buffer size: '1'GB
No. of iterations: '600'
Total duration: '81.9's
Average bandwidth per shim DMA: '14.7' GB/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:66:00.1] : tct-one-col
Description : Measure average TCT processing time for one column
Xclbin : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
Details : Kernel name is 'DPU_PDI_0'
DPU-Sequence : /opt/xilinx/xrt/amdxdna/bins/dpu_sequence/tct_1col.txt
Details : Buffer size: '4'bytes
No. of iterations: '10000'
Average time for TCT: '3.9' us
Average TCT throughput: '258836.9' TCT/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 4 [0000:66:00.1] : tct-all-col
Description : Measure average TCT processing time for all columns
Xclbin : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
Details : Kernel name is 'DPU_PDI_0'
DPU-Sequence : /opt/xilinx/xrt/amdxdna/bins/dpu_sequence/tct_1col.txt
Details : Buffer size: '4' bytes
No. of iterations: '20000'
Average time for TCT: '1.9' us
Average TCT throughput: '524821.0' TCT/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 5 [0000:66:00.1] : gemm
Description : Measure the TOPS value of GEMM operations
Details : bins/1502_00/ not available. Skipping validation.
Test Status : [SKIPPED]
A few weeks ago the amdxdna driver oopsed in every 'xbutil validate' run so yes, you are really lucky ;) My kernel is exact the same as v6.8.7-iommu-sva-part4-v7. BTW: I've started seeing the rare 'amdgpu: [gfxhub] page fault' messages after the iommu patch applied while using normal desktop apps (no xdna tasks involved) but it still unknown to me where to report such cases.
Hi @hegdevasant, @Sfinx is observing 'amdgpu: [gfxhub] page fault' issue with kernel v6.8.7-iommu-sva-part4-v7. Any idea where to report such cases?
@keryell , the xbutil tool usage is documented in XRT document. See https://xilinx.github.io/XRT/master/html/xbutil.html
The Ryzen support of xbutil is still early access.
the xbutil tool usage is documented in XRT document. See xilinx.github.io/XRT/master/html/xbutil.html
The Ryzen support of xbutil is still early access.
It would be nice to add a use case in the README of this repository.
Hi @hegdevasant, @Sfinx is observing 'amdgpu: [gfxhub] page fault' issue with kernel v6.8.7-iommu-sva-part4-v7. Any idea where to report such cases?
On the other hand v6.9 is coming soon and it might solve a lot of issues including this one.
Any on-going work to have xdna-driver
on top of v6.9?
@mamin506 @maxzhen Please feel free to invite me to all the AMD internal meetings on this driver.
@Sfinx Also moving to 24.04 might help, with some new firmware and new X11 server. I am still on 23.10 but plan to move next month.
@keryell, I'm using old good X.org so no big news here ;) Will wait for month until things around 24.04 will settle down only then will move. May be the iommu patches will be already in kernel upstream.
@keryell, I'm using old good X.org so no big news here ;) Will wait for month until things around 24.04 will settle down only then will move. May be the iommu patches will be already in kernel upstream.
Not before v6.10 in the most optimistic case, so you will need an Ubuntu HWE kernel, or move to non LTS version.
I have pushed the 6.8.8 kernel branch for this project on https://github.com/AMD-SW/linux There are a few AMD-related patches from upstream, it might help?
The page faults dissappeared with iommu=pt.
BTW: 6.9 is out
The page faults dissappeared with iommu=pt.
BTW: 6.9 is out
I will double check with IOMMU team for the support of 6.9.
BTW, the "xbutil validate" crash issue has been fix by #86. Please build a new XRT package and try.
Can't reproduce kernel crash anymore. And the mlir_aie examples work like a sharm. Thanks !
Waiting for 6.9 commit to https://github.com/AMD-SW/linux at least
AMD Ryzen 7 7840HS, Ubuntu 22.04:
'xbutil validate' freezes at:
Oops: