Closed materusPL closed 3 months ago
tried with 6900XT building blender with latest sources. Seems to be working fine.
Will try to get hold of a 7900XTX with 5.7 to see if its reproduce-able.
Same issue. OS: Manjaro GPUs: RX 7900 XTX + iGPU from R9 7950X
It would help if you can set environment variable AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 and run blender in command line and collect the output. Thanks.
Hello,
Same issue with Arch Linux: in both Blender 3.6 and 4.0, switching from Cycles from "CPU" to "GPU Compute" freezes Blender (Zl+ status in ps), and going to "Preferences -> System" freezes the entire OS (without possibility to log in remotely with SSH, so this is more than a graphical interface issue). Reverting the distribution to packages from 16 November 2023 (notably including HIP 5.6.1-1, build date 2 September 2023) solves the issue for both Blender 3.6 and 4.0.
@materusPL can you try running this with HIP_VISIBLE_DEVICES=0
There are two GPUs on your system, Navi 31 + iGPU. ROCm does not officially support iGPUs this variable should hide it from HIP runtime.
P.S. this is not an ideal way to debug but I can not seem to reproduce it on my local system.
After system update, if iGPU is enabled it will just fail with something like hipInvalidDevice
and no devices listed in blender settings. HIP_VISIBLE_DEVICES=0
didn't change that.
I disabled iGPU in bios so only 7900 XTX should be visible. After doing it behaviour is same as before, blender freezes but log is much shorter
Edit: Here's error with iGPU enabled, not crashing but no HIP devices to select.
With HIP 5.6 everything still works fine.
Does disabling IOMMU in BIOS help? You can boot to BIOS and search for IOMMU and disabling it.
Disabling IOMMU in BIOS deosn't seem to change anything. After disabling IOMMU I checked both with iGPU disabled and enabled, behaviour is same as in my previous comment,
Can confirm that this happens after installing a 7900 XTX, there is only one graphics device installed on this system (no iGPU, no crossfire).
This is reproducible consistently when hip-runtime-amd is installed, opening blender preferences is sufficient to trigger the crash (but I'm sure attempting a render would as well).
$ AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 blender
Read prefs: "/home/cocaine/.config/blender/4.0/config/userpref.blend"
:3:rocdevice.cpp :442 : 1641939715 us: [pid:28657 tid:0x7f5236686000] Initializing HSA stack.
[1] 28657 terminated AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 blender
dmesg output for posterity:
[Nov30 01:49] amdgpu 0000:28:00.0: amdgpu: bo 00000000414d2147 va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000840
[ +0.000011] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22
[ +0.000002] amdgpu: Failed to map bo to gpuvm
[ +0.008763] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ +0.000006] #PF: supervisor read access in kernel mode
[ +0.000003] #PF: error_code(0x0000) - not-present page
[ +0.000002] PGD 0 P4D 0
[ +0.000003] Oops: 0000 [#5] PREEMPT SMP NOPTI
[ +0.000004] CPU: 13 PID: 28657 Comm: blender Tainted: G D 6.6.2-zen1-1-zen #1 6d6e8f60a278275566e6df0e1eb1d11309a4619b
[ +0.000004] Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.I0 07/27/2022
[ +0.000002] RIP: 0010:dma_resv_add_fence+0x47/0x1e0
[ +0.000006] Code: 89 54 24 04 48 85 f6 74 21 48 8d 7e 38 b8 01 00 00 00 f0 0f c1 46 38 85 c0 0f 84 49 01 00 00 8d 50 01 09 c2 0f 88 4d 01 00 00 <49> 8b 45 08 48 3d 40 37 5a 8b 0f 84 c9 00 00 00 48 3d e0 36 5a 8b
[ +0.000003] RSP: 0018:ffffc90015c37c78 EFLAGS: 00010246
[ +0.000002] RAX: ffff888498308000 RBX: ffff888498308158 RCX: 000000000a75060d
[ +0.000003] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff888498308158
[ +0.000002] RBP: ffff8882a7163000 R08: 0000000000000000 R09: 000000000003a5f0
[ +0.000002] R10: ffff8883838c79c0 R11: 0000000000000000 R12: ffff888246fbbb38
[ +0.000002] R13: 0000000000000000 R14: ffff8882a7163730 R15: ffff888498308000
[ +0.000002] FS: 00007f5236686000(0000) GS:ffff888ffed40000(0000) knlGS:0000000000000000
[ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000002] CR2: 0000000000000008 CR3: 0000000457960000 CR4: 00000000003506e0
[ +0.000003] Call Trace:
[ +0.000002] <TASK>
[ +0.000004] ? __die+0x10f/0x120
[ +0.000004] ? page_fault_oops+0x171/0x4e0
[ +0.000004] ? srso_return_thunk+0x5/0x10
[ +0.000003] ? dma_fence_default_wait+0x8e/0x270
[ +0.000006] ? exc_page_fault+0x7f/0x180
[ +0.000003] ? asm_exc_page_fault+0x26/0x30
[ +0.000006] ? dma_resv_add_fence+0x47/0x1e0
[ +0.000006] amdgpu_amdkfd_gpuvm_acquire_process_vm+0x212/0x530 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[ +0.000350] kfd_process_device_init_vm+0xb0/0x390 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[ +0.000339] ? srso_return_thunk+0x5/0x10
[ +0.000004] kfd_ioctl_acquire_vm+0x89/0xc0 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[ +0.000333] kfd_ioctl+0x3cb/0x4e0 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[ +0.000309] ? __pfx_kfd_ioctl_acquire_vm+0x10/0x10 [amdgpu 0dd8c010605dbf7d42b1189a1f26ca6df6a6333a]
[ +0.000304] __x64_sys_ioctl+0x97/0xd0
[ +0.000004] do_syscall_64+0x60/0x90
[ +0.000005] ? srso_return_thunk+0x5/0x10
[ +0.000003] ? __x64_sys_ioctl+0xaf/0xd0
[ +0.000003] ? srso_return_thunk+0x5/0x10
[ +0.000002] ? syscall_exit_to_user_mode+0x2b/0x40
[ +0.000002] ? srso_return_thunk+0x5/0x10
[ +0.000002] ? do_syscall_64+0x6c/0x90
[ +0.000003] ? srso_return_thunk+0x5/0x10
[ +0.000002] ? exc_page_fault+0x7f/0x180
[ +0.000003] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ +0.000004] RIP: 0033:0x7f524b03d3af
[ +0.000016] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ +0.000002] RSP: 002b:00007ffcadef39d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ +0.000003] RAX: ffffffffffffffda RBX: 00007ffcadef3ac0 RCX: 00007f524b03d3af
[ +0.000002] RDX: 00007ffcadef3b40 RSI: 0000000040084b15 RDI: 000000000000000d
[ +0.000002] RBP: 00007ffcadef3b40 R08: 000000000000000c R09: 00007f5236685698
[ +0.000002] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000040084b15
[ +0.000002] R13: 000000000000000d R14: 00007f51e6713140 R15: 00007f520ca42180
[ +0.000005] </TASK>
[ +0.000001] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cfg80211 rfkill 8021q garp mrp stp llc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec edac_mce_amd snd_hda_core vfat snd_hwdep fat snd_pcm mousedev joydev ccp snd_timer amdgpu snd soundcore kvm irqbypass crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 drm_exec amdxcp drm_buddy gpu_sched drm_suballoc_helper drm_ttm_helper usbhid ttm drm_display_helper igb aesni_intel cec crypto_simd k10temp cryptd i2c_algo_bit video dca sp5100_tco acpi_cpufreq pcspkr rapl wmi_bmof mac_hid gpio_amdpt gpio_generic uinput i2c_piix4 i2c_dev crypto_user fuse loop dm_mod ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq nvme crc32c_intel mxm_wmi nvme_core xhci_pci xhci_pci_renesas nvme_common wmi
[ +0.000076] CR2: 0000000000000008
[ +0.000003] ---[ end trace 0000000000000000 ]---
[ +0.000002] RIP: 0010:dma_resv_add_fence+0x47/0x1e0
[ +0.000003] Code: 89 54 24 04 48 85 f6 74 21 48 8d 7e 38 b8 01 00 00 00 f0 0f c1 46 38 85 c0 0f 84 49 01 00 00 8d 50 01 09 c2 0f 88 4d 01 00 00 <49> 8b 45 08 48 3d 40 37 5a 8b 0f 84 c9 00 00 00 48 3d e0 36 5a 8b
[ +0.000002] RSP: 0018:ffffc90005c2bc98 EFLAGS: 00010246
[ +0.000002] RAX: ffff888109320000 RBX: ffff888109320158 RCX: 0000000001fb6c0b
[ +0.000002] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff888109320158
[ +0.000002] RBP: ffff8882a142d000 R08: 0000000000000000 R09: 000000000003a5f0
[ +0.000002] R10: ffff888335527c20 R11: 0000000000000000 R12: ffff8882a1813338
[ +0.000001] R13: 0000000000000000 R14: ffff8882a142d730 R15: ffff888109320000
[ +0.000002] FS: 00007f5236686000(0000) GS:ffff888ffed40000(0000) knlGS:0000000000000000
[ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000002] CR2: 0000000000000008 CR3: 0000000457960000 CR4: 00000000003506e0
[ +0.000002] note: blender[28657] exited with irqs disabled
I can, and will, gladly cooperate by providing whatever information or logs you want, just say the word.
With HIP 6.0 version problem still exists
AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 LD_LIBRARY_PATH=/opt/rocm-6.0.0/lib ./blender
Read prefs: "/home/materus/.config/blender/4.0/config/userpref.blend"
:3:rocdevice.cpp :442 : 0365231276 us: [pid:19635 tid:0x7f1dcaebe580] Initializing HSA stack.
:1:rocdevice.cpp :450 : 0365257328 us: [pid:19635 tid:0x7f1dcaebe580] hsa_init failed.
:1:runtime.cpp :78 : 0365257335 us: [pid:19635 tid:0x7f1dcaebe580] Runtime initialization failed
:3:hip_context.cpp :153 : 0365257343 us: [pid:19635 tid:0x7f1dcaebe580] hipInit: Returned hipErrorInvalidDevice :
HIP hipInit: Invalid device
Also heres log with version 5.6.1, this works fine
AMD_LOG_LEVEL=3 AMD_SERIALIZE_KERNEL=1 AMD_SERIALIZE_COPY=1 LD_LIBRARY_PATH=/opt/rocm-5.6.1/lib ./blender
Read prefs: "/home/materus/.config/blender/4.0/config/userpref.blend"
:3:rocdevice.cpp :434 : 0385852298 us: 19822: [tid:0x7f80cb078580] Initializing HSA stack.
:3:comgrctx.cpp :33 : 0385867501 us: 19822: [tid:0x7f80cb078580] Loading COMGR library.
:3:rocdevice.cpp :200 : 0385867561 us: 19822: [tid:0x7f80cb078580] Numa selects cpu agent[0]=0x7f806dce1000(fine=0x7f8085030b40,coarse=0x7f8085030dc0) for gpu agent=0x7f806d7ff700
:3:rocdevice.cpp :1634: 0385867858 us: 19822: [tid:0x7f80cb078580] HMM support: 0, xnack: 0, direct host access: 0
:3:rocdevice.cpp :200 : 0385868722 us: 19822: [tid:0x7f80cb078580] Numa selects cpu agent[0]=0x7f806dce1000(fine=0x7f8085030b40,coarse=0x7f8085030dc0) for gpu agent=0x7f806d7ffc00
:3:rocdevice.cpp :1634: 0385868779 us: 19822: [tid:0x7f80cb078580] HMM support: 0, xnack: 0, direct host access: 0
:3:hip_context.cpp :48 : 0385869307 us: 19822: [tid:0x7f80cb078580] Direct Dispatch: 1
:3:hip_context.cpp :159 : 0385869323 us: 19822: [tid:0x7f80cb078580] hipInit: Returned hipSuccess :
:3:hip_device_runtime.cpp :546 : 0385869341 us: 19822: [tid:0x7f80cb078580] hipGetDeviceCount ( 0x7ffc183dcc5c )
:3:hip_device_runtime.cpp :548 : 0385869344 us: 19822: [tid:0x7f80cb078580] hipGetDeviceCount: Returned hipSuccess :
:3:hip_device.cpp :235 : 0385869350 us: 19822: [tid:0x7f80cb078580] hipDeviceGetName ( 0x7ffc183dcd70, 256, 0 )
:3:hip_device.cpp :255 : 0385869353 us: 19822: [tid:0x7f80cb078580] hipDeviceGetName: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869358 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc64, 23, 0 )
:3:hip_device_runtime.cpp :351 : 0385869362 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869365 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dccd0, 61, 0 )
:3:hip_device_runtime.cpp :351 : 0385869368 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_peer.cpp :176 : 0385869376 us: 19822: [tid:0x7f80cb078580] hipDeviceCanAccessPeer ( 0x7ffc183dccb0, 0, 1 )
:3:hip_peer.cpp :177 : 0385869379 us: 19822: [tid:0x7f80cb078580] hipDeviceCanAccessPeer: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869383 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc64, 69, 0 )
:3:hip_device_runtime.cpp :348 : 0385869386 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipErrorInvalidValue :
:3:hip_device_runtime.cpp :141 : 0385869390 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc68, 67, 0 )
:3:hip_device_runtime.cpp :351 : 0385869393 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869397 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc6c, 68, 0 )
:3:hip_device_runtime.cpp :351 : 0385869400 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869406 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc60, 18, 0 )
:3:hip_device_runtime.cpp :351 : 0385869409 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_device.cpp :235 : 0385869419 us: 19822: [tid:0x7f80cb078580] hipDeviceGetName ( 0x7ffc183dcd70, 256, 1 )
:3:hip_device.cpp :255 : 0385869422 us: 19822: [tid:0x7f80cb078580] hipDeviceGetName: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869426 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc64, 23, 1 )
:3:hip_device_runtime.cpp :351 : 0385869429 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869433 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dccd0, 61, 1 )
:3:hip_device_runtime.cpp :351 : 0385869436 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_peer.cpp :176 : 0385869440 us: 19822: [tid:0x7f80cb078580] hipDeviceCanAccessPeer ( 0x7ffc183dccb0, 1, 0 )
:3:hip_peer.cpp :177 : 0385869443 us: 19822: [tid:0x7f80cb078580] hipDeviceCanAccessPeer: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869447 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc64, 69, 1 )
:3:hip_device_runtime.cpp :348 : 0385869450 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipErrorInvalidValue :
:3:hip_device_runtime.cpp :141 : 0385869454 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc68, 67, 1 )
:3:hip_device_runtime.cpp :351 : 0385869457 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869460 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc6c, 68, 1 )
:3:hip_device_runtime.cpp :351 : 0385869464 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
:3:hip_device_runtime.cpp :141 : 0385869468 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute ( 0x7ffc183dcc60, 18, 1 )
:3:hip_device_runtime.cpp :351 : 0385869471 us: 19822: [tid:0x7f80cb078580] hipDeviceGetAttribute: Returned hipSuccess :
@materusPL the hsa_init seems to have failed.
With 6.0 can you check the rocminfo
to see if you are getting correct results.
rocminfo seems to return little different results on 5.6 and 6.0 but other than that seems to work fine.
Can confirm the same issue as described here with gfx1100
(7900 XTX) on arch (6.6.7-arch1-1
). System also has a gfx1036
from 7950X3D
. Using ROCm 5.7.1
. Overriding HIP_VISIBLE_DEVICES
had no effect on the issue with blender-4.1
, though resolved PyTorch issues with Stable Diffusion.
Update: After disabling iGPU on BIOS, as also suggested in this thread, overriding HIP_VISIBLE_DEVICES
is no longer necessary for PyTorch. However, the issue with HIP in blender still persists.
Yep same with 5.7 and 6.0. I actually noticed that Blender freeze back when 5.7 went out, so I downgraded to 5.6.1 and though something like that would be fixed quick . When I tried ROCm 6.0 I was surprised that it still there.
My PC is a Ryzen 5600x with a 7900xt . The OS is Manjaro. I dont have have any IGPU.
I would be curious is there somebody here with 7900xt/xtx that has Blender on Linux working with ROCm 5.7 or 6.0 ? If so what is your OS?
Same for me . Anyone solve it in Ubuntu? Or tried rocm 5.7.3?
After updating to kernel 6.7.5, it seems to work fine on HIP 5.7.3 I guess it was related to amdgpu driver then.
@materusPL Please advise if we can close this ticket now? Thanks!
@ppanchad-amd Yes, I think it can be closed.
After updating to HIP 5.7, blender freezes after opening preferences window if hip was selected before, if it was not selected it will freeze after selecting hip and closing preferences window. Everything is working fine on HIP 5.6 or older
OS: NixOS, checked also in ubuntu container with 5.7 and older versions. GPUs: RX 7900 XTX + iGPU from R9 7950X
Dmesg log after freeze: dmesg.log
Edit: Tried also disabling iGPU in BIOS and stubbing 7900 XTX but no difference