ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.7k stars 528 forks source link

HIP 5.7 Makes blender freeze #3353

Closed materusPL closed 3 months ago

materusPL commented 11 months ago

After updating to HIP 5.7, blender freezes after opening preferences window if hip was selected before, if it was not selected it will freeze after selecting hip and closing preferences window. Everything is working fine on HIP 5.6 or older

OS: NixOS, checked also in ubuntu container with 5.7 and older versions. GPUs: RX 7900 XTX + iGPU from R9 7950X

Dmesg log after freeze: dmesg.log

Edit: Tried also disabling iGPU in BIOS and stubbing 7900 XTX but no difference

cjatin commented 10 months ago

tried with 6900XT building blender with latest sources. Seems to be working fine.

Will try to get hold of a 7900XTX with 5.7 to see if its reproduce-able.

Denkisen commented 10 months ago

Same issue. OS: Manjaro GPUs: RX 7900 XTX + iGPU from R9 7950X

yxsamliu commented 10 months ago

It would help if you can set environment variable AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 and run blender in command line and collect the output. Thanks.

materusPL commented 10 months ago
Log ```bash materus@materusPC in ~ bash ❯ AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 blender Read prefs: "/home/materus/.config/blender/3.6/config/userpref.blend" :3:rocdevice.cpp :442 : 189001661543 us: [pid:1834699 tid:0x7fa342c1c880] Initializing HSA stack. :3:rocdevice.cpp :210 : 189002415299 us: [pid:1834699 tid:0x7fa342c1c880] Numa selects cpu agent[0]=0x7fa31aa4e000(fine=0x7fa3055e3300,coarse=0x7fa3055e36c0) for gpu agent=0x7fa3055d6000 CPU<->GPU XGMI=0 :3:rocdevice.cpp :1681: 189002416442 us: [pid:1834699 tid:0x7fa342c1c880] Gfx Major/Minor/Stepping: 11/0/0 :3:rocdevice.cpp :1683: 189002416446 us: [pid:1834699 tid:0x7fa342c1c880] HMM support: 0, XNACK: 0, Direct host access: 0 :3:rocdevice.cpp :1685: 189002416449 us: [pid:1834699 tid:0x7fa342c1c880] Max SDMA Read Mask: 0x0, Max SDMA Write Mask: 0x0 :3:rocdevice.cpp :210 : 189002417007 us: [pid:1834699 tid:0x7fa342c1c880] Numa selects cpu agent[0]=0x7fa31aa4e000(fine=0x7fa3055e3300,coarse=0x7fa3055e36c0) for gpu agent=0x7fa2e93b1400 CPU<->GPU XGMI=0 :3:rocdevice.cpp :1681: 189002417076 us: [pid:1834699 tid:0x7fa342c1c880] Gfx Major/Minor/Stepping: 10/3/6 :3:rocdevice.cpp :1683: 189002417078 us: [pid:1834699 tid:0x7fa342c1c880] HMM support: 0, XNACK: 0, Direct host access: 0 :3:rocdevice.cpp :1685: 189002417081 us: [pid:1834699 tid:0x7fa342c1c880] Max SDMA Read Mask: 0xffd2d5da, Max SDMA Write Mask: 0xffd2d5da :3:hip_context.cpp :48 : 189002417605 us: [pid:1834699 tid:0x7fa342c1c880] Direct Dispatch: 1 :3:hip_context.cpp :153 : 189002417621 us: [pid:1834699 tid:0x7fa342c1c880] hipInit ( 0 ) :3:hip_context.cpp :159 : 189002417623 us: [pid:1834699 tid:0x7fa342c1c880] hipInit: Returned hipSuccess : :3:hip_device_runtime.cpp :546 : 189002417626 us: [pid:1834699 tid:0x7fa342c1c880] hipGetDeviceCount ( 0x7ffd7d820668 ) :3:hip_device_runtime.cpp :548 : 189002417629 us: [pid:1834699 tid:0x7fa342c1c880] hipGetDeviceCount: Returned hipSuccess : :3:hip_device.cpp :237 : 189002417632 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetName ( 0x7ffd7d820770, 256, 0 ) :3:hip_device.cpp :257 : 189002417634 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetName: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002417636 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d82066c, 23, 0 ) :3:hip_device_runtime.cpp :351 : 189002417639 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002417642 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d820670, 61, 0 ) :3:hip_device_runtime.cpp :351 : 189002417644 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_peer.cpp :176 : 189002417648 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceCanAccessPeer ( 0x7ffd7d820670, 0, 1 ) :3:hip_peer.cpp :177 : 189002418011 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceCanAccessPeer: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002418015 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d8206a0, 69, 0 ) :3:hip_device_runtime.cpp :348 : 189002418017 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipErrorInvalidValue : :3:hip_device_runtime.cpp :141 : 189002418020 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d8206a4, 67, 0 ) :3:hip_device_runtime.cpp :351 : 189002418022 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002418024 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d8206a8, 68, 0 ) :3:hip_device_runtime.cpp :351 : 189002418027 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002418031 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d82066c, 18, 0 ) :3:hip_device_runtime.cpp :351 : 189002418034 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_device.cpp :237 : 189002418044 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetName ( 0x7ffd7d820770, 256, 1 ) :3:hip_device.cpp :257 : 189002418046 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetName: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002418048 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d82066c, 23, 1 ) :3:hip_device_runtime.cpp :351 : 189002418051 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002418053 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d820670, 61, 1 ) :3:hip_device_runtime.cpp :351 : 189002418055 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_peer.cpp :176 : 189002418058 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceCanAccessPeer ( 0x7ffd7d820670, 1, 0 ) :3:hip_peer.cpp :177 : 189002418060 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceCanAccessPeer: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002418062 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d8206a0, 69, 1 ) :3:hip_device_runtime.cpp :348 : 189002418064 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipErrorInvalidValue : :3:hip_device_runtime.cpp :141 : 189002418067 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d8206a4, 67, 1 ) :3:hip_device_runtime.cpp :351 : 189002418069 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002418071 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d8206a8, 68, 1 ) :3:hip_device_runtime.cpp :351 : 189002418073 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : :3:hip_device_runtime.cpp :141 : 189002418076 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute ( 0x7ffd7d82066c, 18, 1 ) :3:hip_device_runtime.cpp :351 : 189002418078 us: [pid:1834699 tid:0x7fa342c1c880] hipDeviceGetAttribute: Returned hipSuccess : Terminated ```
juuR4ati commented 10 months ago

Hello,

Same issue with Arch Linux: in both Blender 3.6 and 4.0, switching from Cycles from "CPU" to "GPU Compute" freezes Blender (Zl+ status in ps), and going to "Preferences -> System" freezes the entire OS (without possibility to log in remotely with SSH, so this is more than a graphical interface issue). Reverting the distribution to packages from 16 November 2023 (notably including HIP 5.6.1-1, build date 2 September 2023) solves the issue for both Blender 3.6 and 4.0.

cjatin commented 10 months ago

@materusPL can you try running this with HIP_VISIBLE_DEVICES=0

There are two GPUs on your system, Navi 31 + iGPU. ROCm does not officially support iGPUs this variable should hide it from HIP runtime.

P.S. this is not an ideal way to debug but I can not seem to reproduce it on my local system.

materusPL commented 10 months ago

After system update, if iGPU is enabled it will just fail with something like hipInvalidDevice and no devices listed in blender settings. HIP_VISIBLE_DEVICES=0 didn't change that.

I disabled iGPU in bios so only 7900 XTX should be visible. After doing it behaviour is same as before, blender freezes but log is much shorter

Log with igpu disabled ```bash materus@materusPC in ~ bash ❯ AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 blender Read prefs: "/home/materus/.config/blender/4.0/config/userpref.blend" :3:rocdevice.cpp :442 : 0190739209 us: [pid:12319 tid:0x7f4f4ba9b880] Initializing HSA stack. Terminated ```
Dmesg log with igpu disabled ```bash materus@materusPC in ~ bash ❯ dmesg [ 190.753161] amdgpu 0000:03:00.0: amdgpu: bo 000000007aa38eba va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000840 [ 190.753166] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22 [ 190.753168] amdgpu: Failed to map bo to gpuvm [ 190.760271] amdgpu 0000:03:00.0: amdgpu: bo 000000007aa38eba va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000840 [ 190.760274] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22 [ 190.760276] amdgpu: Failed to map bo to gpuvm [ 190.760281] ------------[ cut here ]------------ [ 190.760282] kernel BUG at mm/slub.c:448! [ 190.760287] invalid opcode: 0000 [#5] PREEMPT SMP NOPTI [ 190.760289] CPU: 2 PID: 12319 Comm: .blender-wrappe Tainted: G D W O 6.6.1-zen1 #1-NixOS [ 190.760291] Hardware name: ASUS System Product Name/TUF GAMING X670E-PLUS WIFI, BIOS 1636 07/28/2023 [ 190.760293] RIP: 0010:__kmem_cache_free+0x2c7/0x2e0 [ 190.760297] Code: 5d e9 7d e7 ff ff 49 8b 47 08 f0 48 83 28 01 0f 85 87 fe ff ff 49 8b 47 08 4c 89 ff 48 8b 40 08 ff d0 0f 1f 00 e9 72 fe ff ff <0f> 0b 48 8b 15 60 e5 4a 01 e9 6a fd ff ff 66 66 2e 0f 1f 84 00 00 [ 190.760298] RSP: 0018:ffffc9001677fd60 EFLAGS: 00010246 [ 190.760300] RAX: ffff8881031f4980 RBX: ffff8881031f4900 RCX: 0000000000000000 [ 190.760302] RDX: 1088cc4e81b0b6c5 RSI: ffffea0000000000 RDI: ffff8881031f4900 [ 190.760303] RBP: ffffc9001677fd90 R08: 000000000000b602 R09: ffff88846f921c98 [ 190.760304] R10: 0000000000000002 R11: 0000000000000000 R12: ffff88810004ff00 [ 190.760305] R13: ffffea00040c7d00 R14: ffffffffc0df2deb R15: ffff88810606c900 [ 190.760306] FS: 00007f4f4ba9b880(0000) GS:ffff88846f900000(0000) knlGS:0000000000000000 [ 190.760307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 190.760308] CR2: 00007f4ef36c6140 CR3: 0000000561df6000 CR4: 0000000000f50ee0 [ 190.760310] PKRU: 55555554 [ 190.760311] Call Trace: [ 190.760312] [ 190.760314] ? die+0x36/0x90 [ 190.760317] ? do_trap+0xda/0x100 [ 190.760319] ? __kmem_cache_free+0x2c7/0x2e0 [ 190.760321] ? do_error_trap+0x6a/0x90 [ 190.760323] ? __kmem_cache_free+0x2c7/0x2e0 [ 190.760325] ? exc_invalid_op+0x50/0x70 [ 190.760327] ? __kmem_cache_free+0x2c7/0x2e0 [ 190.760329] ? asm_exc_invalid_op+0x1a/0x20 [ 190.760332] ? kfd_process_device_init_vm+0x21b/0x320 [amdgpu] [ 190.760460] ? __kmem_cache_free+0x2c7/0x2e0 [ 190.760464] kfd_process_device_init_vm+0x21b/0x320 [amdgpu] [ 190.760577] kfd_ioctl_acquire_vm+0x86/0xb0 [amdgpu] [ 190.760684] kfd_ioctl+0x3b2/0x4c0 [amdgpu] [ 190.760787] ? __pfx_kfd_ioctl_acquire_vm+0x10/0x10 [amdgpu] [ 190.760888] ? srso_alias_return_thunk+0x5/0x7f [ 190.760893] __x64_sys_ioctl+0x94/0xd0 [ 190.760896] do_syscall_64+0x3b/0x90 [ 190.760899] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 190.760901] RIP: 0033:0x7f4f4f52133f [ 190.760919] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 190.760920] RSP: 002b:00007fff72d047c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 190.760922] RAX: ffffffffffffffda RBX: 00007fff72d04850 RCX: 00007f4f4f52133f [ 190.760923] RDX: 00007fff72d04850 RSI: 0000000040084b15 RDI: 0000000000000013 [ 190.760924] RBP: 0000000000000013 R08: 0000000000000000 R09: 0000000000000001 [ 190.760925] R10: 0000000000000001 R11: 0000000000000246 R12: 00007fff72d04870 [ 190.760926] R13: 0000000000000001 R14: 0000000040084b15 R15: 0000000000000001 [ 190.760930] [ 190.760930] Modules linked in: rfcomm qrtr snd_seq_dummy snd_hrtimer snd_seq nft_masq af_packet ipt_REJECT nf_reject_ipv4 nft_chain_nat wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel ccm algif_aead crypto_null cbc des_generic libdes ecb md4 cmac algif_hash algif_skcipher xt_conntrack af_alg bnep ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat nbd nf_tables nfnetlink sch_fq_codel uinput ctr atkbd libps2 vivaldi_fmap loop iptable_mangle vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM xt_comment veth tun tap macvlan bridge stp llc v4l2loopback(O) videodev vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd i2c_dev snd_usb_audio input_leds joydev snd_usbmidi_lib snd_rawmidi hid_steam snd_seq_device xpad mc ff_memless mousedev hid_generic edac_mce_amd mt7921e snd_hda_codec_realtek nls_iso8859_1 edac_core [ 190.760981] mt7921_common intel_rapl_msr nls_cp437 intel_rapl_common snd_hda_codec_generic mt792x_lib vfat fat mt76_connac_lib snd_hda_codec_hdmi mt76 kvm_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm btusb snd_hda_codec mac80211 btrtl irqbypass crc32_pclmul btintel snd_hda_core polyval_clmulni polyval_generic gf128mul btbcm ghash_clmulni_intel btmtk snd_hwdep sha512_ssse3 snd_pcm r8169 sha512_generic bluetooth aesni_intel realtek ecdh_generic snd_timer sp5100_tco crypto_simd mdio_devres usbhid ecc cfg80211 watchdog cryptd crc16 snd libaes hid ccp rapl k10temp libphy i2c_piix4 soundcore libarc4 tpm_crb tiny_power_button tpm_tis gpio_amdpt tpm_tis_core gpio_generic eeepc_wmi asus_nb_wmi button asus_wmi i8042 battery ledtrig_audio sparse_keymap serio platform_profile rfkill evdev led_class mac_hid wmi_bmof amdgpu drm_exec amdxcp drm_buddy gpu_sched drm_suballoc_helper drm_ttm_helper ttm drm_display_helper drm_kms_helper agpgart i2c_algo_bit drm video wmi pci_stub fuse backlight efi_pstore configfs zstd zram [ 190.761034] zsmalloc efivarfs tpm rng_core dmi_sysfs ip_tables x_tables autofs4 xhci_pci xhci_pci_renesas firmware_class xhci_hcd ahci libahci libata nvme usbcore nvme_core scsi_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_pclmul usb_common scsi_common crct10dif_common rtc_cmos dm_mod dax btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq [ 190.761060] ---[ end trace 0000000000000000 ]--- [ 190.837784] RIP: 0010:__kmem_cache_free+0x2c7/0x2e0 [ 190.837788] Code: 5d e9 7d e7 ff ff 49 8b 47 08 f0 48 83 28 01 0f 85 87 fe ff ff 49 8b 47 08 4c 89 ff 48 8b 40 08 ff d0 0f 1f 00 e9 72 fe ff ff <0f> 0b 48 8b 15 60 e5 4a 01 e9 6a fd ff ff 66 66 2e 0f 1f 84 00 00 [ 190.837790] RSP: 0018:ffffc900163f7d60 EFLAGS: 00010246 [ 190.837792] RAX: ffff88853ad3da80 RBX: ffff88853ad3da00 RCX: 0000000000000000 [ 190.837793] RDX: 101b007785b0b6c5 RSI: ffffea0000000000 RDI: ffff88853ad3da00 [ 190.837794] RBP: ffffc900163f7d90 R08: 000000000000041f R09: ffff88886dda1c98 [ 190.837796] R10: 0000000000000002 R11: 0000000000000000 R12: ffff88810004ff00 [ 190.837797] R13: ffffea0014eb4f00 R14: ffffffffc0df2deb R15: ffff88810606c900 [ 190.837799] FS: 00007f4f4ba9b880(0000) GS:ffff88846fd00000(0000) knlGS:0000000000000000 [ 190.837800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 190.837802] CR2: 00007f4ef3476568 CR3: 0000000561df6000 CR4: 0000000000f50ee0 [ 190.837803] PKRU: 55555554 ```

Edit: Here's error with iGPU enabled, not crashing but no HIP devices to select.

Log with iGPU enabled in BIOS ```bash materus@materusPC in ~ bash ❯ AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 blender Read prefs: "/home/materus/.config/blender/4.0/config/userpref.blend" :3:rocdevice.cpp :442 : 0130990360 us: [pid:12426 tid:0x7f7474755880] Initializing HSA stack. :1:rocdevice.cpp :450 : 0131015231 us: [pid:12426 tid:0x7f7474755880] hsa_init failed. :1:runtime.cpp :78 : 0131015238 us: [pid:12426 tid:0x7f7474755880] Runtime initialization failed :3:hip_context.cpp :153 : 0131015255 us: [pid:12426 tid:0x7f7474755880] hipInit: Returned hipErrorInvalidDevice : HIP hipInit: Invalid device Saved session recovery to "/tmp/quit.blend" Blender quit materus@materusPC in ~ bash ❯ HIP_VISIBLE_DEVICES=0 AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 blender Read prefs: "/home/materus/.config/blender/4.0/config/userpref.blend" :3:rocdevice.cpp :442 : 0218168752 us: [pid:15532 tid:0x7ff6cc955880] Initializing HSA stack. :1:rocdevice.cpp :450 : 0218647998 us: [pid:15532 tid:0x7ff6cc955880] hsa_init failed. :1:runtime.cpp :78 : 0218648012 us: [pid:15532 tid:0x7ff6cc955880] Runtime initialization failed :3:hip_context.cpp :153 : 0218648034 us: [pid:15532 tid:0x7ff6cc955880] hipInit: Returned hipErrorInvalidDevice : HIP hipInit: Invalid device Saved session recovery to "/tmp/quit.blend" Blender quit ```
dmesg log with iGPU enabled in BIOS ```bash materus@materusPC in ~ bash ❯ dmesg [ 1105.873752] [drm] PCIE GART of 512M enabled (table at 0x00000085FEB00000). [ 1105.873771] [drm] PSP is resuming... [ 1105.909371] [drm] reserve 0x1300000 from 0x85fc000000 for PSP TMR [ 1106.097161] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available [ 1106.097163] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 1106.097165] amdgpu 0000:03:00.0: amdgpu: SMU is resuming... [ 1106.097168] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x0000003f, smu fw program = 0, smu fw version = 0x004e7300 (78.115.0) [ 1106.097170] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched [ 1106.269339] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully! [ 1106.271698] [drm] DMUB hardware initialized: version=0x07002100 [ 1106.279951] [drm] kiq ring mec 3 pipe 1 q 0 [ 1106.286794] [drm] VCN decode and encode initialized successfully(under DPG Mode). [ 1106.286945] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully. [ 1106.287234] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 1106.287236] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [ 1106.287237] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [ 1106.287238] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ 1106.287238] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ 1106.287239] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ 1106.287240] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ 1106.287241] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ 1106.287242] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ 1106.287243] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 [ 1106.287244] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0 [ 1106.287245] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8 [ 1106.287246] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8 [ 1106.287247] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8 [ 1106.287248] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0 [ 1106.290694] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes [ 1106.291195] [drm] ring gfx_32792.1.1 was added [ 1106.291538] [drm] ring compute_32792.2.2 was added [ 1106.291796] [drm] ring sdma_32792.3.3 was added [ 1106.291822] [drm] ring gfx_32792.1.1 ib test pass [ 1106.291845] [drm] ring compute_32792.2.2 ib test pass [ 1106.291918] [drm] ring sdma_32792.3.3 ib test pass [ 1106.301018] amdgpu 0000:53:00.0: amdgpu: bo 00000000135b2fc8 va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000200 [ 1106.301022] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22 [ 1106.301023] amdgpu: Failed to map bo to gpuvm [ 1106.309236] amdgpu 0000:53:00.0: amdgpu: bo 00000000135b2fc8 va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000200 [ 1106.309239] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22 [ 1106.309240] amdgpu: Failed to map bo to gpuvm ```

With HIP 5.6 everything still works fine.

yxsamliu commented 10 months ago

Does disabling IOMMU in BIOS help? You can boot to BIOS and search for IOMMU and disabling it.

materusPL commented 10 months ago

Disabling IOMMU in BIOS deosn't seem to change anything. After disabling IOMMU I checked both with iGPU disabled and enabled, behaviour is same as in my previous comment,

TheBeardOfTruth commented 9 months ago

Can confirm that this happens after installing a 7900 XTX, there is only one graphics device installed on this system (no iGPU, no crossfire).

This is reproducible consistently when hip-runtime-amd is installed, opening blender preferences is sufficient to trigger the crash (but I'm sure attempting a render would as well).

$ AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 blender
Read prefs: "/home/cocaine/.config/blender/4.0/config/userpref.blend"
:3:rocdevice.cpp            :442 : 1641939715 us: [pid:28657 tid:0x7f5236686000] Initializing HSA stack.
[1]    28657 terminated  AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 blender

dmesg output for posterity:

[Nov30 01:49] amdgpu 0000:28:00.0: amdgpu: bo 00000000414d2147 va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000840
[  +0.000011] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22
[  +0.000002] amdgpu: Failed to map bo to gpuvm
[  +0.008763] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  +0.000006] #PF: supervisor read access in kernel mode
[  +0.000003] #PF: error_code(0x0000) - not-present page
[  +0.000002] PGD 0 P4D 0 
[  +0.000003] Oops: 0000 [#5] PREEMPT SMP NOPTI
[  +0.000004] CPU: 13 PID: 28657 Comm: blender Tainted: G      D            6.6.2-zen1-1-zen #1 6d6e8f60a278275566e6df0e1eb1d11309a4619b
[  +0.000004] Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.I0 07/27/2022
[  +0.000002] RIP: 0010:dma_resv_add_fence+0x47/0x1e0
[  +0.000006] Code: 89 54 24 04 48 85 f6 74 21 48 8d 7e 38 b8 01 00 00 00 f0 0f c1 46 38 85 c0 0f 84 49 01 00 00 8d 50 01 09 c2 0f 88 4d 01 00 00 <49> 8b 45 08 48 3d 40 37 5a 8b 0f 84 c9 00 00 00 48 3d e0 36 5a 8b
[  +0.000003] RSP: 0018:ffffc90015c37c78 EFLAGS: 00010246
[  +0.000002] RAX: ffff888498308000 RBX: ffff888498308158 RCX: 000000000a75060d
[  +0.000003] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff888498308158
[  +0.000002] RBP: ffff8882a7163000 R08: 0000000000000000 R09: 000000000003a5f0
[  +0.000002] R10: ffff8883838c79c0 R11: 0000000000000000 R12: ffff888246fbbb38
[  +0.000002] R13: 0000000000000000 R14: ffff8882a7163730 R15: ffff888498308000
[  +0.000002] FS:  00007f5236686000(0000) GS:ffff888ffed40000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 0000000000000008 CR3: 0000000457960000 CR4: 00000000003506e0
[  +0.000003] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ? __die+0x10f/0x120
[  +0.000004]  ? page_fault_oops+0x171/0x4e0
[  +0.000004]  ? srso_return_thunk+0x5/0x10
[  +0.000003]  ? dma_fence_default_wait+0x8e/0x270
[  +0.000006]  ? exc_page_fault+0x7f/0x180
[  +0.000003]  ? asm_exc_page_fault+0x26/0x30
[  +0.000006]  ? dma_resv_add_fence+0x47/0x1e0
[  +0.000006]  amdgpu_amdkfd_gpuvm_acquire_process_vm+0x212/0x530 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[  +0.000350]  kfd_process_device_init_vm+0xb0/0x390 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[  +0.000339]  ? srso_return_thunk+0x5/0x10
[  +0.000004]  kfd_ioctl_acquire_vm+0x89/0xc0 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[  +0.000333]  kfd_ioctl+0x3cb/0x4e0 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[  +0.000309]  ? __pfx_kfd_ioctl_acquire_vm+0x10/0x10 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[  +0.000304]  __x64_sys_ioctl+0x97/0xd0
[  +0.000004]  do_syscall_64+0x60/0x90
[  +0.000005]  ? srso_return_thunk+0x5/0x10
[  +0.000003]  ? __x64_sys_ioctl+0xaf/0xd0
[  +0.000003]  ? srso_return_thunk+0x5/0x10
[  +0.000002]  ? syscall_exit_to_user_mode+0x2b/0x40
[  +0.000002]  ? srso_return_thunk+0x5/0x10
[  +0.000002]  ? do_syscall_64+0x6c/0x90
[  +0.000003]  ? srso_return_thunk+0x5/0x10
[  +0.000002]  ? exc_page_fault+0x7f/0x180
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  +0.000004] RIP: 0033:0x7f524b03d3af
[  +0.000016] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[  +0.000002] RSP: 002b:00007ffcadef39d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000003] RAX: ffffffffffffffda RBX: 00007ffcadef3ac0 RCX: 00007f524b03d3af
[  +0.000002] RDX: 00007ffcadef3b40 RSI: 0000000040084b15 RDI: 000000000000000d
[  +0.000002] RBP: 00007ffcadef3b40 R08: 000000000000000c R09: 00007f5236685698
[  +0.000002] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000040084b15
[  +0.000002] R13: 000000000000000d R14: 00007f51e6713140 R15: 00007f520ca42180
[  +0.000005]  </TASK>
[  +0.000001] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cfg80211 rfkill 8021q garp mrp stp llc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec edac_mce_amd snd_hda_core vfat snd_hwdep fat snd_pcm mousedev joydev ccp snd_timer amdgpu snd soundcore kvm irqbypass crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 drm_exec amdxcp drm_buddy gpu_sched drm_suballoc_helper drm_ttm_helper usbhid ttm drm_display_helper igb aesni_intel cec crypto_simd k10temp cryptd i2c_algo_bit video dca sp5100_tco acpi_cpufreq pcspkr rapl wmi_bmof mac_hid gpio_amdpt gpio_generic uinput i2c_piix4 i2c_dev crypto_user fuse loop dm_mod ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq nvme crc32c_intel mxm_wmi nvme_core xhci_pci xhci_pci_renesas nvme_common wmi
[  +0.000076] CR2: 0000000000000008
[  +0.000003] ---[ end trace 0000000000000000 ]---
[  +0.000002] RIP: 0010:dma_resv_add_fence+0x47/0x1e0
[  +0.000003] Code: 89 54 24 04 48 85 f6 74 21 48 8d 7e 38 b8 01 00 00 00 f0 0f c1 46 38 85 c0 0f 84 49 01 00 00 8d 50 01 09 c2 0f 88 4d 01 00 00 <49> 8b 45 08 48 3d 40 37 5a 8b 0f 84 c9 00 00 00 48 3d e0 36 5a 8b
[  +0.000002] RSP: 0018:ffffc90005c2bc98 EFLAGS: 00010246
[  +0.000002] RAX: ffff888109320000 RBX: ffff888109320158 RCX: 0000000001fb6c0b
[  +0.000002] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff888109320158
[  +0.000002] RBP: ffff8882a142d000 R08: 0000000000000000 R09: 000000000003a5f0
[  +0.000002] R10: ffff888335527c20 R11: 0000000000000000 R12: ffff8882a1813338
[  +0.000001] R13: 0000000000000000 R14: ffff8882a142d730 R15: ffff888109320000
[  +0.000002] FS:  00007f5236686000(0000) GS:ffff888ffed40000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 0000000000000008 CR3: 0000000457960000 CR4: 00000000003506e0
[  +0.000002] note: blender[28657] exited with irqs disabled

I can, and will, gladly cooperate by providing whatever information or logs you want, just say the word.

materusPL commented 9 months ago

With HIP 6.0 version problem still exists

AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1  LD_LIBRARY_PATH=/opt/rocm-6.0.0/lib ./blender
Read prefs: "/home/materus/.config/blender/4.0/config/userpref.blend"
:3:rocdevice.cpp            :442 : 0365231276 us: [pid:19635 tid:0x7f1dcaebe580] Initializing HSA stack.
:1:rocdevice.cpp            :450 : 0365257328 us: [pid:19635 tid:0x7f1dcaebe580] hsa_init failed.
:1:runtime.cpp              :78  : 0365257335 us: [pid:19635 tid:0x7f1dcaebe580] Runtime initialization failed
:3:hip_context.cpp          :153 : 0365257343 us: [pid:19635 tid:0x7f1dcaebe580] hipInit: Returned hipErrorInvalidDevice : 
HIP hipInit: Invalid device

Also heres log with version 5.6.1, this works fine

AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1  LD_LIBRARY_PATH=/opt/rocm-5.6.1/lib ./blender
Read prefs: "/home/materus/.config/blender/4.0/config/userpref.blend"
:3:rocdevice.cpp            :434 : 0385852298 us: 19822: [tid:0x7f80cb078580] Initializing HSA stack.
:3:comgrctx.cpp             :33  : 0385867501 us: 19822: [tid:0x7f80cb078580] Loading COMGR library.
:3:rocdevice.cpp            :200 : 0385867561 us: 19822: [tid:0x7f80cb078580] Numa selects cpu agent[0]=0x7f806dce1000(fine=0x7f8085030b40,coarse=0x7f8085030dc0) for gpu agent=0x7f806d7ff700
:3:rocdevice.cpp            :1634: 0385867858 us: 19822: [tid:0x7f80cb078580] HMM support: 0, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :200 : 0385868722 us: 19822: [tid:0x7f80cb078580] Numa selects cpu agent[0]=0x7f806dce1000(fine=0x7f8085030b40,coarse=0x7f8085030dc0) for gpu agent=0x7f806d7ffc00
:3:rocdevice.cpp            :1634: 0385868779 us: 19822: [tid:0x7f80cb078580] HMM support: 0, xnack: 0, direct host access: 0

:3:hip_context.cpp          :48  : 0385869307 us: 19822: [tid:0x7f80cb078580] Direct Dispatch: 1
:3:hip_context.cpp          :159 : 0385869323 us: 19822: [tid:0x7f80cb078580] hipInit: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :546 : 0385869341 us: 19822: [tid:0x7f80cb078580]  hipGetDeviceCount ( 0x7ffc183dcc5c ) 
:3:hip_device_runtime.cpp   :548 : 0385869344 us: 19822: [tid:0x7f80cb078580] hipGetDeviceCount: Returned hipSuccess : 
:3:hip_device.cpp           :235 : 0385869350 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetName ( 0x7ffc183dcd70, 256, 0 ) 
:3:hip_device.cpp           :255 : 0385869353 us: 19822: [tid:0x7f80cb078580] hipDeviceGetName: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869358 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc64, 23, 0 ) 
:3:hip_device_runtime.cpp   :351 : 0385869362 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869365 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dccd0, 61, 0 ) 
:3:hip_device_runtime.cpp   :351 : 0385869368 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_peer.cpp             :176 : 0385869376 us: 19822: [tid:0x7f80cb078580]  hipDeviceCanAccessPeer ( 0x7ffc183dccb0, 0, 1 ) 
:3:hip_peer.cpp             :177 : 0385869379 us: 19822: [tid:0x7f80cb078580] hipDeviceCanAccessPeer: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869383 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc64, 69, 0 ) 
:3:hip_device_runtime.cpp   :348 : 0385869386 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipErrorInvalidValue : 
:3:hip_device_runtime.cpp   :141 : 0385869390 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc68, 67, 0 ) 
:3:hip_device_runtime.cpp   :351 : 0385869393 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869397 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc6c, 68, 0 ) 
:3:hip_device_runtime.cpp   :351 : 0385869400 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869406 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc60, 18, 0 ) 
:3:hip_device_runtime.cpp   :351 : 0385869409 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_device.cpp           :235 : 0385869419 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetName ( 0x7ffc183dcd70, 256, 1 ) 
:3:hip_device.cpp           :255 : 0385869422 us: 19822: [tid:0x7f80cb078580] hipDeviceGetName: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869426 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc64, 23, 1 ) 
:3:hip_device_runtime.cpp   :351 : 0385869429 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869433 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dccd0, 61, 1 ) 
:3:hip_device_runtime.cpp   :351 : 0385869436 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_peer.cpp             :176 : 0385869440 us: 19822: [tid:0x7f80cb078580]  hipDeviceCanAccessPeer ( 0x7ffc183dccb0, 1, 0 ) 
:3:hip_peer.cpp             :177 : 0385869443 us: 19822: [tid:0x7f80cb078580] hipDeviceCanAccessPeer: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869447 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc64, 69, 1 ) 
:3:hip_device_runtime.cpp   :348 : 0385869450 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipErrorInvalidValue : 
:3:hip_device_runtime.cpp   :141 : 0385869454 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc68, 67, 1 ) 
:3:hip_device_runtime.cpp   :351 : 0385869457 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869460 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc6c, 68, 1 ) 
:3:hip_device_runtime.cpp   :351 : 0385869464 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :141 : 0385869468 us: 19822: [tid:0x7f80cb078580]  hipDeviceGetAttribute ( 0x7ffc183dcc60, 18, 1 ) 
:3:hip_device_runtime.cpp   :351 : 0385869471 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess : 
cjatin commented 9 months ago

@materusPL the hsa_init seems to have failed. With 6.0 can you check the rocminfo to see if you are getting correct results.

materusPL commented 9 months ago

rocminfo seems to return little different results on 5.6 and 6.0 but other than that seems to work fine.

rocminfo 5.6 ``` ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 9 7950X 16-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7950X 16-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 5881 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 7940376(0x792918) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 7940376(0x792918) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 7940376(0x792918) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: AMD Ryzen 9 7950X 16-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7950X 16-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 1 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 5881 BDFID: 0 Internal Node ID: 1 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 8091384(0x7b76f8) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 8091384(0x7b76f8) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 8091384(0x7b76f8) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 3 ******* Name: gfx1100 Uuid: GPU-c458f816e7264448 Marketing Name: Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 6144(0x1800) KB L3: 98304(0x18000) KB Chip ID: 29772(0x744c) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2371 BDFID: 768 Internal Node ID: 2 Compute Unit: 96 SIMDs per CU: 2 Shader Engines: 6 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 528 SDMA engine uCode:: 19 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 ******* Agent 4 ******* Name: gfx1036 Uuid: GPU-XX Marketing Name: Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 3 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 256(0x100) KB Chip ID: 5710(0x164e) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2200 BDFID: 21248 Internal Node ID: 3 Compute Unit: 2 SIMDs per CU: 2 Shader Engines: 1 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 20 SDMA engine uCode:: 8 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 81920(0x14000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1036 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```
rocminfo 6.0 ``` ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 9 7950X 16-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7950X 16-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 5881 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 7940376(0x792918) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 7940376(0x792918) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 7940376(0x792918) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: AMD Ryzen 9 7950X 16-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7950X 16-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 1 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 5881 BDFID: 0 Internal Node ID: 1 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 8091384(0x7b76f8) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 8091384(0x7b76f8) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 8091384(0x7b76f8) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 3 ******* Name: gfx1100 Uuid: GPU-c458f816e7264448 Marketing Name: Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 6144(0x1800) KB L3: 98304(0x18000) KB Chip ID: 29772(0x744c) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2371 BDFID: 768 Internal Node ID: 2 Compute Unit: 96 SIMDs per CU: 2 Shader Engines: 6 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 528 SDMA engine uCode:: 19 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 ******* Agent 4 ******* Name: gfx1036 Uuid: GPU-XX Marketing Name: Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 3 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 256(0x100) KB Chip ID: 5710(0x164e) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2200 BDFID: 21248 Internal Node ID: 3 Compute Unit: 2 SIMDs per CU: 2 Shader Engines: 1 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 20 SDMA engine uCode:: 8 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 81920(0x14000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 81920(0x14000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1036 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```
swarminglogic commented 9 months ago

Can confirm the same issue as described here with gfx1100 (7900 XTX) on arch (6.6.7-arch1-1). System also has a gfx1036 from 7950X3D. Using ROCm 5.7.1. Overriding HIP_VISIBLE_DEVICES had no effect on the issue with blender-4.1, though resolved PyTorch issues with Stable Diffusion.

Update: After disabling iGPU on BIOS, as also suggested in this thread, overriding HIP_VISIBLE_DEVICES is no longer necessary for PyTorch. However, the issue with HIP in blender still persists.

leucome commented 9 months ago

Yep same with 5.7 and 6.0. I actually noticed that Blender freeze back when 5.7 went out, so I downgraded to 5.6.1 and though something like that would be fixed quick . When I tried ROCm 6.0 I was surprised that it still there.

My PC is a Ryzen 5600x with a 7900xt . The OS is Manjaro. I dont have have any IGPU.

I would be curious is there somebody here with 7900xt/xtx that has Blender on Linux working with ROCm 5.7 or 6.0 ? If so what is your OS?

habernir commented 8 months ago

Same for me . Anyone solve it in Ubuntu? Or tried rocm 5.7.3?

materusPL commented 7 months ago

After updating to kernel 6.7.5, it seems to work fine on HIP 5.7.3 I guess it was related to amdgpu driver then.

ppanchad-amd commented 3 months ago

@materusPL Please advise if we can close this ticket now? Thanks!

materusPL commented 3 months ago

@ppanchad-amd Yes, I think it can be closed.