amd / xdna-driver

Other
310 stars 40 forks source link

kernel oops while running 'xbutil validate' #46

Closed Sfinx closed 5 months ago

Sfinx commented 6 months ago

AMD Ryzen 7 7840HS, Ubuntu 22.04:

System Configuration
  OS Name              : Linux
  Release              : 6.8.7-060807-generic
  Version              : #202404170934 SMP PREEMPT_DYNAMIC Thu Apr 18 13:01:01 EEST 2024
  Machine              : x86_64
  CPU Cores            : 16
  Memory               : 95777 MB
  Distribution         : Ubuntu 22.04.4 LTS
  GLIBC                : 2.35
  Model                : TUXEDO Sirius 16 Gen1
  BIOS vendor          : American Megatrends International, LLC.
  BIOS version         : V1.00A00_20240108

XRT
  Version              : 2.17.0
  Branch               : HEAD
  Hash                 : baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  Hash Date            : 2024-04-17 13:03:42
  XOCL                 : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  XCLMGMT              : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  AMDXDNA              : 2.17.0_20240417, 35351e4bbbc65568669c36255825425030be721f

Devices present
BDF             :  Name          
---------------------------------
[0000:6a:00.1]  :  RyzenAI-npu1  

'xbutil validate' freezes at:

------------------------------------------------------------
                        EARLY ACCESS                        
        This release of xbutil contains early access        
         experimental features which may have bugs.         
------------------------------------------------------------
Validate Device           : [0000:6a:00.1]
    Platform              : RyzenAI-npu1
-------------------------------------------------------------------------------
Test 1 [0000:6a:00.1]     : verify                                              
    Details               : Kernel name is 'DPU_PDI_0'
                            Total duration: '1.1's
                            Average throughput: '9490.3' ops/s
                            Average latency: '105.4' us
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
[          <->    <->Running Test>    <->]: Running Test... < 59s >
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
  what():  boost::too_many_args: format-string referred to fewer arguments than were passed

Oops:

[Fri Apr 19 07:03:53 2024] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
[Fri Apr 19 07:03:53 2024] CPU: 6 PID: 21072 Comm: xbutil2 Tainted: G        W  O       6.8.7-060807-generic #202404170934
[Fri Apr 19 07:03:53 2024] Hardware name: TUXEDO TUXEDO Sirius 16 Gen1/APX958, BIOS V1.00A00_20240108 01/08/2024
[Fri Apr 19 07:03:53 2024] RIP: 0010:amdxdna_flush+0x39/0xa0 [amdxdna]
[Fri Apr 19 07:03:53 2024] Code: c8 00 00 00 48 8b 98 98 00 00 00 4c 8b 63 68 66 90 49 81 c4 28 06 00 00 4c 89 e7 e8 d1 fe 91 eb 48 8b 13 48 8b 43 08 48 89 df <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 03 48 83
[Fri Apr 19 07:03:53 2024] RSP: 0018:ffff9fd28b9cfc40 EFLAGS: 00010246
[Fri Apr 19 07:03:53 2024] RAX: dead000000000122 RBX: ffff8ea8a814cfc0 RCX: ffff8ea7763e0800
[Fri Apr 19 07:03:53 2024] RDX: dead000000000100 RSI: ffff8ea7503f1b80 RDI: ffff8ea8a814cfc0
[Fri Apr 19 07:03:53 2024] RBP: ffff9fd28b9cfc50 R08: 0000000000000000 R09: 0000000000000000
[Fri Apr 19 07:03:53 2024] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8ea7cc212628
[Fri Apr 19 07:03:53 2024] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8eb3b04cb340
[Fri Apr 19 07:03:53 2024] FS:  0000000000000000(0000) GS:ffff8ebc1e700000(0000) knlGS:0000000000000000
[Fri Apr 19 07:03:53 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Apr 19 07:03:53 2024] CR2: 00007dfa3c600000 CR3: 0000000f2623c000 CR4: 0000000000f50ef0
[Fri Apr 19 07:03:53 2024] PKRU: 55555554
[Fri Apr 19 07:03:53 2024] Call Trace:
[Fri Apr 19 07:03:53 2024]  <TASK>
[Fri Apr 19 07:03:53 2024]  ? show_regs+0x6d/0x80
[Fri Apr 19 07:03:53 2024]  ? die_addr+0x37/0xa0
[Fri Apr 19 07:03:53 2024]  ? exc_general_protection+0x1db/0x480
[Fri Apr 19 07:03:53 2024]  ? asm_exc_general_protection+0x27/0x30
[Fri Apr 19 07:03:53 2024]  ? amdxdna_flush+0x39/0xa0 [amdxdna]
[Fri Apr 19 07:03:53 2024]  filp_flush+0x35/0x90
[Fri Apr 19 07:03:53 2024]  filp_close+0x14/0x30
[Fri Apr 19 07:03:53 2024]  put_files_struct+0x85/0xf0
[Fri Apr 19 07:03:53 2024]  exit_files+0x47/0x60
[Fri Apr 19 07:03:53 2024]  do_exit+0x295/0x530
[Fri Apr 19 07:03:53 2024]  do_group_exit+0x35/0x90
[Fri Apr 19 07:03:53 2024]  get_signal+0x954/0x990
[Fri Apr 19 07:03:53 2024]  ? srso_alias_return_thunk+0x5/0xfbef5
[Fri Apr 19 07:03:53 2024]  ? hrtimer_nanosleep+0xbf/0x1a0
[Fri Apr 19 07:03:53 2024]  arch_do_signal_or_restart+0x39/0x120
[Fri Apr 19 07:03:53 2024]  syscall_exit_to_user_mode+0x209/0x260
[Fri Apr 19 07:03:53 2024]  do_syscall_64+0x8c/0x180
[Fri Apr 19 07:03:53 2024]  ? syscall_exit_to_user_mode+0x89/0x260
[Fri Apr 19 07:03:53 2024]  ? srso_alias_return_thunk+0x5/0xfbef5
[Fri Apr 19 07:03:53 2024]  ? do_syscall_64+0x8c/0x180
[Fri Apr 19 07:03:53 2024]  ? srso_alias_return_thunk+0x5/0xfbef5
[Fri Apr 19 07:03:53 2024]  ? irqentry_exit_to_user_mode+0x7e/0x260
[Fri Apr 19 07:03:53 2024]  ? srso_alias_return_thunk+0x5/0xfbef5
[Fri Apr 19 07:03:53 2024]  ? irqentry_exit+0x43/0x50
[Fri Apr 19 07:03:53 2024]  ? srso_alias_return_thunk+0x5/0xfbef5
[Fri Apr 19 07:03:53 2024]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Fri Apr 19 07:03:53 2024] RIP: 0033:0x79d1b94e57f8
[Fri Apr 19 07:03:53 2024] Code: Unable to access opcode bytes at 0x79d1b94e57ce.
[Fri Apr 19 07:03:53 2024] RSP: 002b:00007fff79b86300 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6
[Fri Apr 19 07:03:53 2024] RAX: fffffffffffffdfc RBX: 00007fff79b86301 RCX: 000079d1b94e57f8
[Fri Apr 19 07:03:53 2024] RDX: 00007fff79b863a0 RSI: 0000000000000000 RDI: 0000000000000000
[Fri Apr 19 07:03:53 2024] RBP: 00007fff79b863d0 R08: 0000000000000000 R09: 0000000000000000
[Fri Apr 19 07:03:53 2024] R10: 00007fff79b863a0 R11: 0000000000000293 R12: 00007fff79b863a0
[Fri Apr 19 07:03:53 2024] R13: 0000000000000000 R14: 00007fff79b863a0 R15: 00007fff79b86c50
[Fri Apr 19 07:03:53 2024]  </TASK>
[Fri Apr 19 07:03:53 2024] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter xfrm_user xfrm_algo rfcomm xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink xclmgmt(O) xocl(O) snd_seq_dummy snd_hrtimer bridge stp llc cmac algif_hash algif_skcipher af_alg bnep nvme_fabrics overlay zram binfmt_misc nls_iso8859_1 btusb btrtl btintel btbcm btmtk bluetooth ecdh_generic ecc snd_ctl_led ledtrig_audio snd_soc_dmic snd_ps_pdm_dma snd_soc_ps_mach snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp intel_rapl_common snd_sof snd_usb_audio uvcvideo edac_mce_amd snd_sof_utils videobuf2_vmalloc uvc snd_usbmidi_lib videobuf2_memops snd_soc_core snd_ump kvm_amd videobuf2_v4l2 ch341 snd_hda_codec_conexant snd_hda_codec_generic usbserial videodev snd_hda_codec_hdmi snd_rawmidi snd_compress kvm videobuf2_common ac97_bus mc snd_pcm_dmaengine irqbypass
[Fri Apr 19 07:03:53 2024]  snd_hda_intel iwlmvm rapl snd_pci_ps snd_intel_dspcfg input_leds mac80211 libarc4 snd_intel_sdw_acpi snd_hda_codec serio_raw snd_rpl_pci_acp6x hid_multitouch snd_acp_pci snd_hda_core snd_acp_legacy_common wmi_bmof snd_pci_acp6x snd_hwdep snd_pcm k10temp iwlwifi snd_pci_acp5x snd_rn_pci_acp3x amdxdna(O) snd_acp_config snd_seq snd_soc_acpi snd_pci_acp3x amd_pmf sp5100_tco snd_seq_device snd_timer cfg80211 snd soundcore amdtee soc_button_array mac_hid ccp amd_sfh tee amd_pmc platform_profile sch_fq_codel nfsd auth_rpcgss msr nfs_acl parport_pc lockd grace ppdev parport bfq efi_pstore sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 uas usbhid usb_storage amdgpu amdxcp drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_ttm_helper crct10dif_pclmul ttm crc32_pclmul drm_display_helper polyval_clmulni polyval_generic cec nvme hid_generic sha256_ssse3 sha1_ssse3 xhci_pci r8169 nvme_core thunderbolt video
[Fri Apr 19 07:03:53 2024]  rc_core xhci_pci_renesas realtek nvme_auth wmi hid aesni_intel crypto_simd cryptd [last unloaded: i2c_hid]
[Fri Apr 19 07:03:53 2024] ---[ end trace 0000000000000000 ]---
[Fri Apr 19 07:03:53 2024] RIP: 0010:amdxdna_flush+0x39/0xa0 [amdxdna]
[Fri Apr 19 07:03:53 2024] Code: c8 00 00 00 48 8b 98 98 00 00 00 4c 8b 63 68 66 90 49 81 c4 28 06 00 00 4c 89 e7 e8 d1 fe 91 eb 48 8b 13 48 8b 43 08 48 89 df <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 03 48 83
[Fri Apr 19 07:03:53 2024] RSP: 0018:ffff9fd28b9cfc40 EFLAGS: 00010246
[Fri Apr 19 07:03:53 2024] RAX: dead000000000122 RBX: ffff8ea8a814cfc0 RCX: ffff8ea7763e0800
[Fri Apr 19 07:03:53 2024] RDX: dead000000000100 RSI: ffff8ea7503f1b80 RDI: ffff8ea8a814cfc0
[Fri Apr 19 07:03:53 2024] RBP: ffff9fd28b9cfc50 R08: 0000000000000000 R09: 0000000000000000
[Fri Apr 19 07:03:53 2024] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8ea7cc212628
[Fri Apr 19 07:03:53 2024] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8eb3b04cb340
[Fri Apr 19 07:03:53 2024] FS:  0000000000000000(0000) GS:ffff8ebc1e700000(0000) knlGS:0000000000000000
[Fri Apr 19 07:03:53 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Apr 19 07:03:53 2024] CR2: 00007dfa3c600000 CR3: 0000000ca84a0000 CR4: 0000000000f50ef0
[Fri Apr 19 07:03:53 2024] PKRU: 55555554
[Fri Apr 19 07:03:53 2024] Fixing recursive fault but reboot is needed!
[Fri Apr 19 07:03:53 2024] BUG: scheduling while atomic: xbutil2/21072/0x00000000
[Fri Apr 19 07:03:53 2024] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter xfrm_user xfrm_algo rfcomm xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink xclmgmt(O) xocl(O) snd_seq_dummy snd_hrtimer bridge stp llc cmac algif_hash algif_skcipher af_alg bnep nvme_fabrics overlay zram binfmt_misc nls_iso8859_1 btusb btrtl btintel btbcm btmtk bluetooth ecdh_generic ecc snd_ctl_led ledtrig_audio snd_soc_dmic snd_ps_pdm_dma snd_soc_ps_mach snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp intel_rapl_common snd_sof snd_usb_audio uvcvideo edac_mce_amd snd_sof_utils videobuf2_vmalloc uvc snd_usbmidi_lib videobuf2_memops snd_soc_core snd_ump kvm_amd videobuf2_v4l2 ch341 snd_hda_codec_conexant snd_hda_codec_generic usbserial videodev snd_hda_codec_hdmi snd_rawmidi snd_compress kvm videobuf2_common ac97_bus mc snd_pcm_dmaengine irqbypass
[Fri Apr 19 07:03:53 2024]  snd_hda_intel iwlmvm rapl snd_pci_ps snd_intel_dspcfg input_leds mac80211 libarc4 snd_intel_sdw_acpi snd_hda_codec serio_raw snd_rpl_pci_acp6x hid_multitouch snd_acp_pci snd_hda_core snd_acp_legacy_common wmi_bmof snd_pci_acp6x snd_hwdep snd_pcm k10temp iwlwifi snd_pci_acp5x snd_rn_pci_acp3x amdxdna(O) snd_acp_config snd_seq snd_soc_acpi snd_pci_acp3x amd_pmf sp5100_tco snd_seq_device snd_timer cfg80211 snd soundcore amdtee soc_button_array mac_hid ccp amd_sfh tee amd_pmc platform_profile sch_fq_codel nfsd auth_rpcgss msr nfs_acl parport_pc lockd grace ppdev parport bfq efi_pstore sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 uas usbhid usb_storage amdgpu amdxcp drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_ttm_helper crct10dif_pclmul ttm crc32_pclmul drm_display_helper polyval_clmulni polyval_generic cec nvme hid_generic sha256_ssse3 sha1_ssse3 xhci_pci r8169 nvme_core thunderbolt video
[Fri Apr 19 07:03:53 2024]  rc_core xhci_pci_renesas realtek nvme_auth wmi hid aesni_intel crypto_simd cryptd [last unloaded: i2c_hid]
[Fri Apr 19 07:03:53 2024] CPU: 6 PID: 21072 Comm: xbutil2 Tainted: G      D W  O       6.8.7-060807-generic #202404170934
[Fri Apr 19 07:03:53 2024] Hardware name: TUXEDO TUXEDO Sirius 16 Gen1/APX958, BIOS V1.00A00_20240108 01/08/2024
[Fri Apr 19 07:03:53 2024] Call Trace:
[Fri Apr 19 07:03:53 2024]  <TASK>
[Fri Apr 19 07:03:53 2024]  dump_stack_lvl+0x76/0xa0
[Fri Apr 19 07:03:53 2024]  dump_stack+0x10/0x20
[Fri Apr 19 07:03:53 2024]  __schedule_bug+0x64/0x80
[Fri Apr 19 07:03:53 2024]  schedule_debug.isra.0+0xdb/0x130
[Fri Apr 19 07:03:53 2024]  __schedule+0x69/0x6b0
[Fri Apr 19 07:03:53 2024]  ? srso_alias_return_thunk+0x5/0xfbef5
[Fri Apr 19 07:03:53 2024]  ? vprintk+0x42/0x80
[Fri Apr 19 07:03:53 2024]  ? srso_alias_return_thunk+0x5/0xfbef5
[Fri Apr 19 07:03:53 2024]  ? _printk+0x60/0x90
[Fri Apr 19 07:03:53 2024]  do_task_dead+0x44/0x50
[Fri Apr 19 07:03:53 2024]  make_task_dead+0x13e/0x140
[Fri Apr 19 07:03:53 2024]  rewind_stack_and_make_dead+0x17/0x20
[Fri Apr 19 07:03:53 2024] RIP: 0033:0x79d1b94e57f8
[Fri Apr 19 07:03:53 2024] Code: Unable to access opcode bytes at 0x79d1b94e57ce.
[Fri Apr 19 07:03:53 2024] RSP: 002b:00007fff79b86300 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6
[Fri Apr 19 07:03:53 2024] RAX: fffffffffffffdfc RBX: 00007fff79b86301 RCX: 000079d1b94e57f8
[Fri Apr 19 07:03:53 2024] RDX: 00007fff79b863a0 RSI: 0000000000000000 RDI: 0000000000000000
[Fri Apr 19 07:03:53 2024] RBP: 00007fff79b863d0 R08: 0000000000000000 R09: 0000000000000000
[Fri Apr 19 07:03:53 2024] R10: 00007fff79b863a0 R11: 0000000000000293 R12: 00007fff79b863a0
[Fri Apr 19 07:03:53 2024] R13: 0000000000000000 R14: 00007fff79b863a0 R15: 00007fff79b86c50
[Fri Apr 19 07:03:53 2024]  </TASK>
Sfinx commented 6 months ago

Additional info: laptop BIOS has three settings for AMD GPU - Hybrid/SAG 1.5/dGPU. When set to dGPU the 'xbutil validate' completes without error.

Disregard the BIOS setting comment. It is appeared that bug can't be reproduced at fresh reboot

Sfinx commented 6 months ago

Steps needed to reproduce the bug after fresh reboot, just run once the stock xdna-driver example:

xdna-driver/build/example_build$ ./example_noop_test /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin

the next 'xbutil validate' wll lead to boom.

BTW: sometime I'm seeing this while running the stock example:

amdxdna 0000:6a:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0001 address=0x7c79c0c61100 flags=0x0027]
mamin506 commented 6 months ago

@Sfinx The "general protection fault" issue is fixed by #47.

I cannot reproduce the IO_PAGE_FAULT issue with xdna-driver example on my board yet. Please let me know if you still see it with #47.

Sfinx commented 6 months ago

This issue not about IO_PAGE_FAULT xdna-driver example but about oops in amdxdna_flush() while running the 'xbutil validate'. Will check #47

Sfinx commented 6 months ago

Can't reproduce the oops with #47 applied. Thanks !

Sfinx commented 6 months ago

Closed too early ;) Oopsed again :

xbutil examine
System Configuration
  OS Name              : Linux
  Release              : 6.8.7-060807-generic
  Version              : #202404170934 SMP PREEMPT_DYNAMIC Thu Apr 18 13:01:01 EEST 2024
  Machine              : x86_64
  CPU Cores            : 16
  Memory               : 95777 MB
  Distribution         : Ubuntu 22.04.4 LTS
  GLIBC                : 2.35
  Model                : TUXEDO Sirius 16 Gen1
  BIOS vendor          : American Megatrends International, LLC.
  BIOS version         : V1.00A00_20240108

XRT
  Version              : 2.17.0
  Branch               : HEAD
  Hash                 : baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  Hash Date            : 2024-04-20 09:27:29
  XOCL                 : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  XCLMGMT              : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  AMDXDNA              : 2.17.0_20240417, 35351e4bbbc65568669c36255825425030be721f

Devices present
BDF             :  Name          
---------------------------------
[0000:6a:00.1]  :  RyzenAI-npu1  

Userspace:

------------------------------------------------------------
                        EARLY ACCESS                        
        This release of xbutil contains early access        
         experimental features which may have bugs.         
------------------------------------------------------------
Validate Device           : [0000:6a:00.1]
    Platform              : RyzenAI-npu1
-------------------------------------------------------------------------------
Test 1 [0000:6a:00.1]     : verify                                              
    Description           : Run 'Hello World' test on IPU
    Xclbin                : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
    Details               : Kernel name is 'DPU_PDI_0'
                            Total duration: '1.1's
                            Average throughput: '9497.7' ops/s
                            Average latency: '105.3' us
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
[         <->        ]: Running Test... < 2s >
[              <->   ]: Running Test... < 12s >
terminate called recursively
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'

Kernel:

oops.txt

All xdna-driver API freezed after oops so reboot is needed

Sfinx commented 6 months ago

Hmm, noticed that hash for amdxdna driver is still old though I've rebulded & reinstalled it as stated in doc. Seems like 'build -clean' had to be issued. Starting over..

Sfinx commented 6 months ago

Okay, validate still crashes but with another message:

System Configuration
  OS Name              : Linux
  Release              : 6.8.7-060807-generic
  Version              : #202404170934 SMP PREEMPT_DYNAMIC Thu Apr 18 13:01:01 EEST 2024
  Machine              : x86_64
  CPU Cores            : 16
  Memory               : 95777 MB
  Distribution         : Ubuntu 22.04.4 LTS
  GLIBC                : 2.35
  Model                : TUXEDO Sirius 16 Gen1
  BIOS vendor          : American Megatrends International, LLC.
  BIOS version         : V1.00A00_20240108

XRT
  Version              : 2.17.0
  Branch               : HEAD
  Hash                 : baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  Hash Date            : 2024-04-20 12:05:46
  XOCL                 : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  XCLMGMT              : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  AMDXDNA              : 2.17.0_20240420, e9eca7b2714afa83949608a8418a08a6a070973c

Devices present
BDF             :  Name          
---------------------------------
[0000:6a:00.1]  :  RyzenAI-npu1  

Userspace:

Verbose: Enabling Verbosity
------------------------------------------------------------
                        EARLY ACCESS                        
        This release of xbutil contains early access        
         experimental features which may have bugs.         
------------------------------------------------------------
Validate Device           : [0000:6a:00.1]
    Platform              : RyzenAI-npu1
-------------------------------------------------------------------------------
Test 1 [0000:6a:00.1]     : verify                                              
    Description           : Run 'Hello World' test on IPU
    Xclbin                : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
    Details               : Kernel name is 'DPU_PDI_0'
                            Total duration: '1.0's
                            Average throughput: '9599.0' ops/s
                            Average latency: '104.2' us
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
[               <->  ]: Running Test... < 18s >
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
  what():  boost::too_many_args: format-string referred to fewer arguments than were passed
/opt/xilinx/xrt/bin/unwrapped/loader: line 61: 19690 Aborted                 (core dumped) "${XRT_PROG_UNWRAPPED}" "${XRT_LOADER_ARGS[@]}"

Kernel:

oops.txt

mamin506 commented 6 months ago

Hi @Sfinx , the kernel oops issue is fixed by #51. But if following your steps, I still have chance to see xbutil validate print,

terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
  what():  boost::too_many_args: format-string referred to fewer arguments than were passed
/opt/xilinx/xrt/bin/unwrapped/loader: line 61: 19690 Aborted                 (core dumped) "${XRT_PROG_UNWRAPPED}" "${XRT_LOADER_ARGS[@]}"

Below is my gdb backtrack. It looks like this is a bug in XRT library. I will forward this issue to XRT.

terminate called recursively
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
  what():  boost::too_many_args: format-string referred to fewer arguments than were passed

Thread 4 "xbutil2" received signal SIGABRT, Aborted.
[Switching to Thread 0x15554d600640 (LWP 147774)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=23456114542144) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=23456114542144) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=23456114542144) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=23456114542144, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x0000155554a42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x0000155554a287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000155554eb042a in __gnu_cxx::__verbose_terminate_handler() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x0000155554eae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000155554eae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x0000155554eae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00005555555d5fb6 in void boost::io::detail::distribute<char, std::char_traits<char>, std::allocator<char>, boost::io::detail::put_holder<char, std::char_traits<char> > const&>(boost::basic_format<char, std::char_traits<char>, std::allocator<char> >&, boost::io::deta
il::put_holder<char, std::char_traits<char> > const&) ()
#10 0x00005555555d5fff in boost::basic_format<char, std::char_traits<char>, std::allocator<char> >& boost::io::detail::feed_impl<char, std::char_traits<char>, std::allocator<char>, boost::io::detail::put_holder<char, std::char_traits<char> > const&>(boost::basic_format<ch
ar, std::char_traits<char>, std::allocator<char> >&, boost::io::detail::put_holder<char, std::char_traits<char> > const&) ()
#11 0x00005555555cecfb in XBUtilities::BusyBar::update() ()
#12 0x0000155554edc253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x0000155554a94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x0000155554b26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Sfinx commented 6 months ago

Thanks for such fast turnaround, testing #51 now..

You are right - this simple xdna-driver example do not catch default exception not sure about XRT bug.

BTW: Where I should report kernel iommu patch issues ? I've started seeing the 'amdgpu: [gfxhub] page fault' messages after the patch applied while using normal desktop apps (no xdna tasks involved)

Sfinx commented 6 months ago

Still glitching:

System Configuration
  OS Name              : Linux
  Release              : 6.8.7-060807-generic
  Version              : #202404170934 SMP PREEMPT_DYNAMIC Sat Apr 20 14:33:18 EEST 2024
  Machine              : x86_64
  CPU Cores            : 16
  Memory               : 95777 MB
  Distribution         : Ubuntu 22.04.4 LTS
  GLIBC                : 2.35
  Model                : TUXEDO Sirius 16 Gen1
  BIOS vendor          : American Megatrends International, LLC.
  BIOS version         : V1.00A00_20240108

XRT
  Version              : 2.17.0
  Branch               : HEAD
  Hash                 : baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  Hash Date            : 2024-04-20 12:05:46
  XOCL                 : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  XCLMGMT              : 2.17.0, baf88820fb3fc24dda4dc08c91ecbca2c76c7b0f
  AMDXDNA              : 2.17.0_20240423, 70f709eda4363af0a9a5824786e2747a7fadf345

Devices present
BDF             :  Name          
---------------------------------
[0000:6a:00.1]  :  RyzenAI-npu1  
------------------------------------------------------------
                        EARLY ACCESS                        
        This release of xbutil contains early access        
         experimental features which may have bugs.         
------------------------------------------------------------
Validate Device           : [0000:6a:00.1]
    Platform              : RyzenAI-npu1
-------------------------------------------------------------------------------
Test 1 [0000:6a:00.1]     : verify                                              
    Description           : Run 'Hello World' test on IPU
    Xclbin                : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
    Details               : Kernel name is 'DPU_PDI_0'
                            Total duration: '1.6's
                            Average throughput: '6305.2' ops/s
                            Average latency: '158.6' us
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
[       <->          ]: Running Test... < 2s >
[      <->           ]: Running Test... < 21s >
terminate called recursively
terminate called after throwing an instance of 'boost::wrapexcept<boost::io::too_many_args>'
/opt/xilinx/xrt/bin/unwrapped/loader: line 61: 94101 Aborted                 (core dumped) "${XRT_PROG_UNWRAPPED}" "${XRT_LOADER_ARGS[@]}"

Kernel:

[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73effc2b00 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c734ffc3a00 flags=0x0030]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73cffc4700 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c736ffc1f00 flags=0x0030]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c736ffc2000 flags=0x0030]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73effc3000 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c734ffc4000 flags=0x0030]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73cffc5000 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c73cffc5800 flags=0x0010]
[Tue Apr 23 15:05:54 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x7c736ffc3000 flags=0x0030]
[Tue Apr 23 15:05:59 2024] amdxdna 0000:6a:00.1: npu_send_mgmt_msg_wait: command opcode 0x3 failed, status 0x2000006
[Tue Apr 23 15:05:59 2024] amdxdna 0000:6a:00.1: npu1_destroy_context: hwctx.94536.0 destroy context failed, ret -22
[Tue Apr 23 15:05:59 2024] amdxdna 0000:6a:00.1: npu1_xrs_unload: destroy context failed, ret -22

Good news: the above error is recoverable and no reboot needed anymore

Sfinx commented 6 months ago

After restart it gave nearly the same kernel message after some time:

[Tue Apr 23 15:50:39 2024] amd_iommu_report_page_fault: 344 callbacks suppressed
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d06af8c800 flags=0x0030]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d04af8e600 flags=0x0030]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0eaf8d500 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0caf8f300 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d06af8d000 flags=0x0030]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d04af8f000 flags=0x0030]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0eaf8e000 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0caf90000 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d0caf90800 flags=0x0010]
[Tue Apr 23 15:50:39 2024] pci 0000:6a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0022 address=0x79d06af8de00 flags=0x0030]
[Tue Apr 23 15:50:39 2024] AMD-Vi: IOMMU Event log restarting
mamin506 commented 6 months ago

@Sfinx , I still cannot reproduce the IO_PAGE_FAULE issue. I have create ticket to the test case owner. Hope we can address the root cause soon. Before that, let's keep this issue open.

Good to know that you don't need reboot to recover. That is what #51 fixed.

Sfinx commented 6 months ago

To reproduce just run some time the 'xbutil validate' and './example_build/example_noop_test './tools/bins/1502_00/validate.xclbin' in parallel. The xbutil will segfault each ~10 mins btw. I guess this is not counted as stress test

mamin506 commented 6 months ago

The xdna device is "0000:6a:00.1", but IO_PAGE_FAULT happens on 6a:00.0. Can you run and post return here? sudo lspci -vvs 6a:00.0

Sfinx commented 6 months ago
6a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 14ec
    Subsystem: Advanced Micro Devices, Inc. [AMD] Device 14ec
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    IOMMU group: 31
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [64] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
        DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 16GT/s (ok), Width x16 (ok)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [d0] SATA HBA v1.0 InCfgSpace
    Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [270 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [2a0 v1] Access Control Services
        ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [450 v1] Lane Margining at the Receiver <?>

PCI tree:

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14e8
           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 14e9
           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 14ea
           +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7700S/7600/7600S/7600M XT/PRO W7600]
           |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 HDMI/DP Audio
           +-01.2-[04]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
           +-02.0  Advanced Micro Devices, Inc. [AMD] Device 14ea
           +-02.1-[05]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
           +-02.2-[06]----00.0  Intel Corporation Wi-Fi 6E(802.11ax) AX210/AX1675* 2x2 [Typhoon Peak]
           +-02.3-[07]--
           +-02.4-[08]----00.0  Realtek Semiconductor Co., Ltd. RTS5762 NVMe SSD Controller
           +-03.0  Advanced Micro Devices, Inc. [AMD] Device 14ea
           +-03.1-[09-68]--
           +-04.0  Advanced Micro Devices, Inc. [AMD] Device 14ea
           +-08.0  Advanced Micro Devices, Inc. [AMD] Device 14ea
           +-08.1-[69]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1
           |            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt Radeon High Definition Audio Controller
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 19h (Model 74h) CCP/PSP 3.0 Device
           |            +-00.3  Advanced Micro Devices, Inc. [AMD] Device 15b9
           |            +-00.4  Advanced Micro Devices, Inc. [AMD] Device 15ba
           |            +-00.5  Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor
           |            \-00.6  Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
           +-08.2-[6a]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14ec
           |            \-00.1  Advanced Micro Devices, Inc. [AMD] AMD IPU Device
           +-08.3-[6b]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14ec
           |            +-00.3  Advanced Micro Devices, Inc. [AMD] Device 15c0
           |            +-00.4  Advanced Micro Devices, Inc. [AMD] Device 15c1
           |            \-00.5  Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller #1
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Device 14f0
           +-18.1  Advanced Micro Devices, Inc. [AMD] Device 14f1
           +-18.2  Advanced Micro Devices, Inc. [AMD] Device 14f2
           +-18.3  Advanced Micro Devices, Inc. [AMD] Device 14f3
           +-18.4  Advanced Micro Devices, Inc. [AMD] Device 14f4
           +-18.5  Advanced Micro Devices, Inc. [AMD] Device 14f5
           +-18.6  Advanced Micro Devices, Inc. [AMD] Device 14f6
           \-18.7  Advanced Micro Devices, Inc. [AMD] Device 14f7
keryell commented 6 months ago

Just trying with latest XDNA and XRT driver and it is still working. Interestingly, I did not know about xbutil validate for the NPU. ;-) @mamin506 where is it documented? I run on a daily basis the validation of https://github.com/Xilinx/mlir-aie with check-aie target on my laptop and I have been lucky for a few weeks now. @Sfinx Is your kernel far from https://github.com/AMD-SW/linux/tree/v6.8.7-iommu-sva-part4-v7 ?

rkeryell@rk-xsj:~/Xilinx/Projects/AIE/xdna-driver/build (main)$ xbutil examine
System Configuration
  OS Name              : Linux
  Release              : 6.8.7+iommu-sva-part4-v7+
  Version              : #1 SMP PREEMPT_DYNAMIC Fri Apr 19 09:35:16 PDT 2024
  Machine              : x86_64
  CPU Cores            : 16
  Memory               : 63575 MB
  Distribution         : Ubuntu 23.10
  GLIBC                : 2.38
  Model                : HP ZBook Power 15.6 inch G10 A Mobile Workstation PC
  BIOS vendor          : HP
  BIOS version         : V85 Ver. 01.04.00

XRT
  Version              : 2.17.0
  Branch               : master
  Hash                 : daf9d07d92ccb8f004f5d8d677e6c855b03514c1
  Hash Date            : 2024-04-24 09:19:33
  XOCL                 : 2.17.0, daf9d07d92ccb8f004f5d8d677e6c855b03514c1
  XCLMGMT              : 2.17.0, daf9d07d92ccb8f004f5d8d677e6c855b03514c1
  AMDXDNA              : 2.17.0_20240424, a2e2ad3c0ea096bf035ae0f439b0350d997bfe7b
  Firmware Version     : N/A

Devices present
BDF             :  Name
---------------------------------
[0000:66:00.1]  :  RyzenAI-npu1

rkeryell@rk-xsj:~/Xilinx/Projects/AIE/xdna-driver/build (main)$ xbutil validate --device 0000:66:00.1
------------------------------------------------------------
                        EARLY ACCESS
        This release of xbutil contains early access
         experimental features which may have bugs.
------------------------------------------------------------
Validate Device           : [0000:66:00.1]
    Platform              : RyzenAI-npu1
    Performance Mode      : Default
-------------------------------------------------------------------------------
Test 1 [0000:66:00.1]     : verify
    Details               : Kernel name is 'DPU_PDI_0'
                            Instruction size: '20' bytes
                            No. of iterations: '10000'
                            Average throughput: '21173.6' ops/s
                            Average latency: '94.5' us
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 2 [0000:66:00.1]     : df-bw
    Details               : Kernel name is 'DPU_PDI_0'
    Details               : Buffer size: '1'GB
                            No. of iterations: '600'
                            Total duration: '85.4's
                            Average bandwidth per shim DMA: '14.1' GB/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:66:00.1]     : tct-one-col
    Details               : Kernel name is 'DPU_PDI_0'
    Details               : Buffer size: '4'bytes
                            No. of iterations: '10000'
                            Average time for TCT: '4.0' us
                            Average TCT throughput: '247076.2' TCT/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 4 [0000:66:00.1]     : tct-all-col
    Details               : Kernel name is 'DPU_PDI_0'
    Details               : Buffer size: '4' bytes
                            No. of iterations: '20000'
                            Average time for TCT: '2.0' us
                            Average TCT throughput: '498471.8' TCT/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Validation completed. Please run the command '--verbose' option for more details
rkeryell@rk-xsj:~/Xilinx/Projects/AIE/xdna-driver/build (main)$ xbutil validate --device 0000:66:00.1
------------------------------------------------------------
                        EARLY ACCESS
        This release of xbutil contains early access
         experimental features which may have bugs.
------------------------------------------------------------
Validate Device           : [0000:66:00.1]
    Platform              : RyzenAI-npu1
    Performance Mode      : Default
-------------------------------------------------------------------------------
Test 1 [0000:66:00.1]     : verify
    Details               : Kernel name is 'DPU_PDI_0'
                            Instruction size: '20' bytes
                            No. of iterations: '10000'
                            Average throughput: '21260.4' ops/s
                            Average latency: '94.4' us
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 2 [0000:66:00.1]     : df-bw
    Details               : Kernel name is 'DPU_PDI_0'
    Details               : Buffer size: '1'GB
                            No. of iterations: '600'
                            Total duration: '86.6's
                            Average bandwidth per shim DMA: '13.9' GB/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:66:00.1]     : tct-one-col
    Details               : Kernel name is 'DPU_PDI_0'
    Details               : Buffer size: '4'bytes
                            No. of iterations: '10000'
                            Average time for TCT: '4.0' us
                            Average TCT throughput: '248398.0' TCT/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 4 [0000:66:00.1]     : tct-all-col
    Details               : Kernel name is 'DPU_PDI_0'
    Details               : Buffer size: '4' bytes
                            No. of iterations: '20000'
                            Average time for TCT: '2.0' us
                            Average TCT throughput: '499935.2' TCT/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Validation completed. Please run the command '--verbose' option for more details
rkeryell@rk-xsj:~/Xilinx/Projects/AIE/xdna-driver/build (main)$ xbutil validate --device 0000:66:00.1 --verbose
Verbose: Enabling Verbosity
------------------------------------------------------------
                        EARLY ACCESS                        
        This release of xbutil contains early access        
         experimental features which may have bugs.         
------------------------------------------------------------
Validate Device           : [0000:66:00.1]
    Platform              : RyzenAI-npu1
    Performance Mode      : Default
-------------------------------------------------------------------------------
Test 1 [0000:66:00.1]     : verify                                              
    Description           : Run end-to-end latency and throughput test on NPU
    Xclbin                : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
    Details               : Kernel name is 'DPU_PDI_0'
                            Instruction size: '20' bytes
                            No. of iterations: '10000'
                            Average throughput: '21319.9' ops/s
                            Average latency: '94.8' us
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 2 [0000:66:00.1]     : df-bw                                               
    Description           : Run bandwidth test on data fabric                   
    Xclbin                : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
    Details               : Kernel name is 'DPU_PDI_0'
    DPU-Sequence          : /opt/xilinx/xrt/amdxdna/bins/dpu_sequence/df_bw.txt
    Details               : Buffer size: '1'GB
                            No. of iterations: '600'
                            Total duration: '81.9's
                            Average bandwidth per shim DMA: '14.7' GB/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:66:00.1]     : tct-one-col                                         
    Description           : Measure average TCT processing time for one column  
    Xclbin                : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
    Details               : Kernel name is 'DPU_PDI_0'
    DPU-Sequence          : /opt/xilinx/xrt/amdxdna/bins/dpu_sequence/tct_1col.txt
    Details               : Buffer size: '4'bytes
                            No. of iterations: '10000'
                            Average time for TCT: '3.9' us
                            Average TCT throughput: '258836.9' TCT/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 4 [0000:66:00.1]     : tct-all-col                                         
    Description           : Measure average TCT processing time for all columns 
    Xclbin                : /opt/xilinx/xrt/amdxdna/bins/1502_00/validate.xclbin
    Details               : Kernel name is 'DPU_PDI_0'
    DPU-Sequence          : /opt/xilinx/xrt/amdxdna/bins/dpu_sequence/tct_1col.txt
    Details               : Buffer size: '4' bytes
                            No. of iterations: '20000'
                            Average time for TCT: '1.9' us
                            Average TCT throughput: '524821.0' TCT/s
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 5 [0000:66:00.1]     : gemm                                                

    Description           : Measure the TOPS value of GEMM operations
    Details               : bins/1502_00/ not available. Skipping validation.
    Test Status           : [SKIPPED]
Sfinx commented 6 months ago

A few weeks ago the amdxdna driver oopsed in every 'xbutil validate' run so yes, you are really lucky ;) My kernel is exact the same as v6.8.7-iommu-sva-part4-v7. BTW: I've started seeing the rare 'amdgpu: [gfxhub] page fault' messages after the iommu patch applied while using normal desktop apps (no xdna tasks involved) but it still unknown to me where to report such cases.

mamin506 commented 6 months ago

Hi @hegdevasant, @Sfinx is observing 'amdgpu: [gfxhub] page fault' issue with kernel v6.8.7-iommu-sva-part4-v7. Any idea where to report such cases?

mamin506 commented 6 months ago

@keryell , the xbutil tool usage is documented in XRT document. See https://xilinx.github.io/XRT/master/html/xbutil.html

The Ryzen support of xbutil is still early access.

keryell commented 6 months ago

the xbutil tool usage is documented in XRT document. See xilinx.github.io/XRT/master/html/xbutil.html

The Ryzen support of xbutil is still early access.

It would be nice to add a use case in the README of this repository.

keryell commented 6 months ago

Hi @hegdevasant, @Sfinx is observing 'amdgpu: [gfxhub] page fault' issue with kernel v6.8.7-iommu-sva-part4-v7. Any idea where to report such cases?

On the other hand v6.9 is coming soon and it might solve a lot of issues including this one. Any on-going work to have xdna-driver on top of v6.9? @mamin506 @maxzhen Please feel free to invite me to all the AMD internal meetings on this driver. @Sfinx Also moving to 24.04 might help, with some new firmware and new X11 server. I am still on 23.10 but plan to move next month.

Sfinx commented 6 months ago

@keryell, I'm using old good X.org so no big news here ;) Will wait for month until things around 24.04 will settle down only then will move. May be the iommu patches will be already in kernel upstream.

keryell commented 6 months ago

@keryell, I'm using old good X.org so no big news here ;) Will wait for month until things around 24.04 will settle down only then will move. May be the iommu patches will be already in kernel upstream.

Not before v6.10 in the most optimistic case, so you will need an Ubuntu HWE kernel, or move to non LTS version.

keryell commented 6 months ago

I have pushed the 6.8.8 kernel branch for this project on https://github.com/AMD-SW/linux There are a few AMD-related patches from upstream, it might help?

Sfinx commented 5 months ago

The page faults dissappeared with iommu=pt.

BTW: 6.9 is out

mamin506 commented 5 months ago

The page faults dissappeared with iommu=pt.

BTW: 6.9 is out

I will double check with IOMMU team for the support of 6.9.

BTW, the "xbutil validate" crash issue has been fix by #86. Please build a new XRT package and try.

Sfinx commented 5 months ago

Can't reproduce kernel crash anymore. And the mlir_aie examples work like a sharm. Thanks !

Sfinx commented 5 months ago

Waiting for 6.9 commit to https://github.com/AMD-SW/linux at least