HIP: can't use GPU with official Tensorflow or PyTorch ROCM containers with Ryzen 5600G

lucasew commented 1 year ago

Describe the bug

I have a Ryzen 5600G APU and I am trying to use Tensorflow or PyTorch to do some machine learning stuff. So far whatever one, I am just trying to make it recognize the GPU and make it usable, and so far I was only able to use it on Blender with blender-hip or a workaround to use it with blender-bin.

Steps To Reproduce

Steps to reproduce the behavior:

For PyTorch

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/pytorch:latest
python
import torch
Error: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" Aborted (core dumped)

For TensorFlow

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/tensorflow:latest
python
import tensorflow as tf
tf.config.list_physical_devices()
Error: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" Aborted (core dumped)

If I do an export HSA_OVERRIDE_GFX_VERSION=10.3.0 and do any activity that actually uses the GPU, like torch.tensor([[1,2],[3,4]]).to(torch.device('cuda') it crashes and dmesg shows the following:

[  810.761484] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761488] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761492] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761499] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x008012B1
[  810.761500] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: SQC (inst) (0x9)
[  810.761501] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x1
[  810.761502] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  810.761503] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0xb
[  810.761503] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  810.761504] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  810.761507] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761509] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761516] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761516] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  810.761517] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  810.761518] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  810.761518] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  810.761519] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  810.761520] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  810.761521] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761522] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761528] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761529] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  810.761530] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  810.761530] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  810.761531] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  810.761532] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  810.761532] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  810.761536] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761542] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761543] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  810.761543] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  810.761544] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  810.761545] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  810.761545] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  810.761546] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  810.761547] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761549] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761555] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761555] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  810.761556] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  810.761557] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  810.761557] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  810.761558] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  810.761559] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  810.761560] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761561] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761567] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761568] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  810.761568] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  810.761569] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  810.761570] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  810.761570] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  810.761571] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  810.761572] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761573] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761579] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761580] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  810.761581] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  810.761581] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  810.761582] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  810.761582] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  810.761583] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  810.761584] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761585] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761591] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761592] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  810.761593] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  810.761593] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  810.761594] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  810.761595] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  810.761595] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  814.761529] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  814.761535] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 6, err_type 2
[  814.761537] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 6, err_type 2
[  814.761538] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 6, err_type 2
[  814.761539] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 6, err_type 2
[  814.761540] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 5, err_type 2
[  814.761541] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 5, err_type 2
[  814.761542] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 5, err_type 2
[  814.761543] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 5, err_type 2
[  814.761544] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 4, err_type 2
[  814.761545] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 4, err_type 2
[  814.761545] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 4, err_type 2
[  814.761546] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 4, err_type 2
[  814.761547] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 3, err_type 2
[  814.761548] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 3, err_type 2
[  814.761549] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 3, err_type 2
[  814.761550] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 3, err_type 2
[  814.761550] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 2, err_type 2
[  814.761551] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 2, err_type 2
[  814.761552] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 2, err_type 2
[  814.761553] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 2, err_type 2
[  814.761554] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 1, err_type 2
[  814.761554] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 1, err_type 2
[  814.761555] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 1, err_type 2
[  814.761556] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 1, err_type 2
[  814.761557] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 0, err_type 2
[  814.761558] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 0, err_type 2
[  814.761558] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 0, err_type 2
[  814.761559] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 0, err_type 2
[  817.502308] ------------[ cut here ]------------
[  817.502313] WARNING: CPU: 11 PID: 2550 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230
[  817.502320] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio led_class xt_pkttype snd_hda_codec_hdmi xt_LOG nf_log_syslog xt_tcpudp nls_iso8859_1 nft_compat nls_cp437 vfat snd_hda_intel fat snd_intel_dspcfg snd_intel_sdw_acpi nft_counter snd_hda_codec intel_rapl_msr wmi_bmof snd_hda_core evdev snd_hwdep r8169 mac_hid snd_pcm realtek snd_timer mdio_devres nf_tables edac_mce_amd snd libphy edac_core soundcore intel_rapl_common libcrc32c crc32_pclmul ghash_clmulni_intel video nfnetlink sp5100_tco aesni_intel watchdog i2c_piix4 k10temp sch_fq_codel libaes deflate crypto_simd cryptd gpio_amdpt efi_pstore gpio_generic wmi pinctrl_amd tiny_power_button acpi_cpufreq rapl button ctr atkbd libps2 serio loop veth
[  817.502356]  bridge stp llc tun vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd kvm irqbypass fuse pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata usbcore nvme scsi_mod nvme_core crc32c_intel t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  817.502381] CPU: 11 PID: 2550 Comm: python Tainted: G        W  O      5.15.82 #1-NixOS
[  817.502383] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  817.502384] RIP: 0010:__flush_work.isra.0+0x21f/0x230
[  817.502386] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 f2 00 81 00 66 90 0f 1f 44 00 00
[  817.502388] RSP: 0018:ffffab1142487b28 EFLAGS: 00010246
[  817.502389] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  817.502390] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8f0281325318
[  817.502391] RBP: ffff8f0281325318 R08: 0000000000000000 R09: ffffffffbe250a50
[  817.502391] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f0281325318
[  817.502392] R13: 0000000000000001 R14: 0000000000000003 R15: ffff8f02b8399d8c
[  817.502392] FS:  0000000000000000(0000) GS:ffff8f058e4c0000(0000) knlGS:0000000000000000
[  817.502393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  817.502394] CR2: 00007ffea315c414 CR3: 0000000148124000 CR4: 0000000000750ee0
[  817.502395] PKRU: 55555554
[  817.502396] Call Trace:
[  817.502398]  <TASK>
[  817.502401]  ? del_timer+0x55/0x80
[  817.502404]  __cancel_work_timer+0x11a/0x1b0
[  817.502406]  kfd_process_notifier_release+0x8b/0x160 [amdgpu]
[  817.502571]  __mmu_notifier_release+0x73/0x210
[  817.502577]  exit_mmap+0x1ad/0x1f0
[  817.502580]  ? delayacct_add_tsk+0x63/0x1b0
[  817.502582]  ? exit_robust_list+0x5c/0x140
[  817.502584]  ? __cond_resched+0x16/0x50
[  817.502586]  ? mutex_lock+0xe/0x30
[  817.502587]  mmput+0x5a/0x140
[  817.502590]  do_exit+0x2f0/0xa40
[  817.502592]  do_group_exit+0x33/0xa0
[  817.502594]  get_signal+0x14a/0x910
[  817.502595]  arch_do_signal_or_restart+0x101/0x730
[  817.502598]  ? do_send_sig_info+0x6b/0xc0
[  817.502600]  ? do_tkill+0x88/0xb0
[  817.502601]  exit_to_user_mode_prepare+0x10e/0x230
[  817.502603]  syscall_exit_to_user_mode+0x18/0x40
[  817.502605]  do_syscall_64+0x48/0x90
[  817.502607]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  817.502608] RIP: 0033:0x7f5550f8400b
[  817.502631] Code: Unable to access opcode bytes at RIP 0x7f5550f83fe1.
[  817.502632] RSP: 002b:00007f5335a6eb20 EFLAGS: 00000246 ORIG_RAX: 000000000000000e
[  817.502633] RAX: 0000000000000000 RBX: 00007f5335a6f700 RCX: 00007f5550f8400b
[  817.502634] RDX: 0000000000000000 RSI: 00007f5335a6eb20 RDI: 0000000000000002
[  817.502635] RBP: 00007f5335a6ee30 R08: 0000000000000000 R09: 00007f5335a6eb20
[  817.502635] R10: 0000000000000008 R11: 0000000000000246 R12: 000056099afb14e0
[  817.502636] R13: 0000000000000000 R14: 00007f5335a6edd0 R15: 0000000000000003
[  817.502637]  </TASK>
[  817.502638] ---[ end trace 1cc27b60f1089df3 ]---
[  821.502652] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  821.502654] amdgpu: Resetting wave fronts (cpsch) on dev 000000008c1046c5

Expected behavior

Machine learning working the same as it would work in Google Colab I guess

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Nixcfg revision used to replicate the issue: https://github.com/lucasew/nixcfg/tree/ff430dc0992d9247989f739a326536f87e345d98/nodes/whiterun

A PC with a i5 6400 + RX460 has the same problem but I don't have access to it anymore to test eventual fixes.

Notify maintainers

@NixOS/rocm-maintainers

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

lucasew@whiterun ~ 134$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.82, NixOS, 22.11 (Raccoon), 22.11.20221216.9d692a7`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.11.0`
 - nixpkgs: `/etc/flake/nixpkgs`

Madouura commented 1 year ago

Can you use the hip from nixos-unstable and tell me if it still gives you that error? Looking through my PRs concerning the ROCm packages I can't find anything that could cause this aside from possibly #206421, and that's not in master yet. Also try using hip from staging (#206421) if you can, see if that works. From what I can see, you're using a docker container and that should have it's own hip, which may be the problem instead of nixpkg's hip.

Madouura commented 1 year ago

I should also mention that I am working on native ROCm support for pytorch and tensorflow in nixpkgs so you don't need to use those docker containers, but that's going to take some time.

Madouura commented 1 year ago

Also try export HSA_OVERRIDE_GFX_VERSION=9.0.0 instead.

Flakebi commented 1 year ago

As far as I see, a Ryzen 5600G has a Vega GPU (gfx9), so I’m not surprised that everything crashes when you force gfx10.3 behavior – two generations later – with HSA_OVERRIDE_GFX_VERSION=10.3.0 :) It seems to be a gfx90c card, so HSA_OVERRIDE_GFX_VERSION=9.0.12 should be more correct.

lucasew commented 1 year ago

As far as I see, a Ryzen 5600G has a Vega GPU (gfx9), so I’m not surprised that everything crashes when you force gfx10.3 behavior – two generations later – with HSA_OVERRIDE_GFX_VERSION=10.3.0 :) It seems to be a gfx90c card, so HSA_OVERRIDE_GFX_VERSION=9.0.12 should be more correct.

About this generation thing I have no idea what I am doing xD just saw people mentioning this on the Internet and decided to try.

Can you use the hip from nixos-unstable and tell me if it still gives you that error? Looking through my PRs concerning the ROCm packages I can't find anything that could cause this aside from possibly #206421, and that's not in master yet. Also try using hip from staging (#206421) if you can, see if that works. From what I can see, you're using a docker container and that should have it's own hip, which may be the problem instead of nixpkg's hip.

Switched to latest unstable rn

Both HSA_OVERRIDE_GFX_VERSION=9.0.12 and HSA_OVERRIDE_GFX_VERSION=9.0.0


>>> import torch
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Aborted (core dumped)


- `HSA_OVERRIDE_GFX_VERSION=10.3.0`

[ 306.174866] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315) [ 306.174872] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2) [ 306.174879] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x008012B1 [ 306.174881] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: SQC (inst) (0x9) [ 306.174882] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1 [ 306.174883] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0 [ 306.174884] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0xb [ 306.174885] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 306.174886] amdgpu 0000:07:00.0: amdgpu: RW: 0x0 [ 306.174889] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315) [ 306.174891] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2) [ 306.174898] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 306.174899] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0) [ 306.174900] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0 [ 306.174901] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0 [ 306.174902] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 306.174903] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 306.174904] amdgpu 0000:07:00.0: amdgpu: RW: 0x0 [ 306.174906] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315) [ 306.174907] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2) [ 306.174914] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 306.174915] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0) [ 306.174916] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0 [ 306.174917] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0 [ 306.174918] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 306.174918] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 306.174919] amdgpu 0000:07:00.0: amdgpu: RW: 0x0 [ 306.174922] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315) [ 306.174924] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2) [ 306.174931] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 306.174931] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0) [ 306.174932] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0 [ 306.174933] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0 [ 306.174934] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 306.174935] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 306.174936] amdgpu 0000:07:00.0: amdgpu: RW: 0x0 [ 306.174937] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315) [ 306.174939] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2) [ 306.174945] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 306.174946] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0) [ 306.174947] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0 [ 306.174948] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0 [ 306.174949] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 306.174950] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 306.174951] amdgpu 0000:07:00.0: amdgpu: RW: 0x0 [ 306.174952] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315) [ 306.174954] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2) [ 306.174960] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 306.174961] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0) [ 306.174962] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0 [ 306.174963] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0 [ 306.174964] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 306.174965] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 306.174965] amdgpu 0000:07:00.0: amdgpu: RW: 0x0 [ 306.174967] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315) [ 306.174968] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2) [ 306.174975] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 306.174976] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0) [ 306.174977] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0 [ 306.174977] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0 [ 306.174978] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 306.174979] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 306.174980] amdgpu 0000:07:00.0: amdgpu: RW: 0x0 [ 306.174981] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315) [ 306.174983] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2) [ 306.174989] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 306.174990] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0) [ 306.174991] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0 [ 306.174992] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0 [ 306.174993] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 306.174994] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 306.174995] amdgpu 0000:07:00.0: amdgpu: RW: 0x0 [ 310.174910] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 310.174915] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 4, err_type 2 [ 310.174918] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 4, err_type 2 [ 310.174919] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 4, err_type 2 [ 310.174920] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 4, err_type 2 [ 310.174921] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 3, err_type 2 [ 310.174922] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 3, err_type 2 [ 310.174923] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 6, err_type 2 [ 310.174923] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 6, err_type 2 [ 310.174924] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 6, err_type 2 [ 310.174925] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 6, err_type 2 [ 310.174926] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 5, err_type 2 [ 310.174927] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 5, err_type 2 [ 310.174927] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 5, err_type 2 [ 310.174928] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 5, err_type 2 [ 310.174929] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 3, err_type 2 [ 310.174930] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 3, err_type 2 [ 310.174931] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 2, err_type 2 [ 310.174931] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 2, err_type 2 [ 310.174932] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 2, err_type 2 [ 310.174933] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 2, err_type 2 [ 310.174934] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 1, err_type 2 [ 310.174935] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 1, err_type 2 [ 310.174936] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 1, err_type 2 [ 310.174936] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 1, err_type 2 [ 310.174937] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 0, err_type 2 [ 310.174938] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 0, err_type 2 [ 310.174939] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 0, err_type 2 [ 310.174940] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 0, err_type 2 [ 312.816528] ------------[ cut here ]------------ [ 312.816531] WARNING: CPU: 2 PID: 2329 at kernel/workqueue.c:3083 flush_work.isra.0+0x21f/0x230 [ 312.816537] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_hda_codec_realtek ip6t_rpfilter ipt_rpfilter snd_hda_codec_generic ledtrig_audio led_class snd_hda_codec_hdmi xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat snd_hda_intel snd_intel_dspcfg nft_counter intel_rapl_msr snd_intel_sdw_acpi snd_hda_codec edac_mce_amd evdev wmi_bmof mac_hid edac_core intel_rapl_common snd_hda_core crc32_pclmul ghash_clmulni_intel aesni_intel snd_hwdep snd_pcm libaes crypto_simd r8169 cryptd nf_tables rapl realtek snd_timer libcrc32c mdio_devres sp5100_tco watchdog snd sch_fq_codel nfnetlink libphy soundcore k10temp i2c_piix4 video gpio_amdpt gpio_generic pinctrl_amd tiny_power_button wmi acpi_cpufreq button ctr atkbd libps2 serio loop veth bridge stp llc tun [ 312.816570] vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata nvme usbcore crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight [ 312.816595] CPU: 2 PID: 2329 Comm: python Tainted: G W O 5.15.83 #1-NixOS [ 312.816597] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021 [ 312.816598] RIP: 0010:flush_work.isra.0+0x21f/0x230 [ 312.816600] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00 [ 312.816601] RSP: 0018:ffffb14001cb7b28 EFLAGS: 00010246 [ 312.816602] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 312.816603] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff92872a69ab18 [ 312.816604] RBP: ffff92872a69ab18 R08: 0000000000000000 R09: ffffffff96450b50 [ 312.816604] R10: 0000000000000000 R11: 0000000000000000 R12: ffff92872a69ab18 [ 312.816605] R13: 0000000000000001 R14: 0000000000000003 R15: ffff928705e5272c [ 312.816606] FS: 0000000000000000(0000) GS:ffff928a0e280000(0000) knlGS:0000000000000000 [ 312.816606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 312.816607] CR2: 0000000000d133a0 CR3: 000000012388a000 CR4: 0000000000750ee0 [ 312.816608] PKRU: 55555554 [ 312.816609] Call Trace: [ 312.816611] [ 312.816614] ? del_timer+0x55/0x80 [ 312.816617] cancel_work_timer+0x11a/0x1b0 [ 312.816619] kfd_process_notifier_release+0x8b/0x160 [amdgpu] [ 312.816786] mmu_notifier_release+0x73/0x210 [ 312.816790] exit_mmap+0x1ad/0x1f0 [ 312.816793] ? delayacct_add_tsk+0x63/0x1b0 [ 312.816795] ? exit_robust_list+0x5c/0x140 [ 312.816796] ? cond_resched+0x16/0x50 [ 312.816799] ? mutex_lock+0xe/0x30 [ 312.816800] mmput+0x5a/0x140 [ 312.816802] do_exit+0x2f0/0xa40 [ 312.816805] do_group_exit+0x33/0xa0 [ 312.816806] get_signal+0x14a/0x910 [ 312.816808] arch_do_signal_or_restart+0x101/0x730 [ 312.816810] ? do_send_sig_info+0x6b/0xc0 [ 312.816812] ? do_tkill+0x88/0xb0 [ 312.816813] exit_to_user_mode_prepare+0x10e/0x230 [ 312.816815] syscall_exit_to_user_mode+0x18/0x40 [ 312.816826] do_syscall_64+0x48/0x90 [ 312.816829] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 312.816831] RIP: 0033:0x7fd93b15d00b [ 312.816848] Code: Unable to access opcode bytes at RIP 0x7fd93b15cfe1. [ 312.816849] RSP: 002b:00007fd71fc46b20 EFLAGS: 00000246 ORIG_RAX: 000000000000000e [ 312.816850] RAX: 0000000000000000 RBX: 00007fd71fc47700 RCX: 00007fd93b15d00b [ 312.816851] RDX: 0000000000000000 RSI: 00007fd71fc46b20 RDI: 0000000000000002 [ 312.816851] RBP: 00007fd71fc46e30 R08: 0000000000000000 R09: 00007fd71fc46b20 [ 312.816852] R10: 0000000000000008 R11: 0000000000000246 R12: 000055c4de1234d0 [ 312.816852] R13: 0000000000000000 R14: 00007fd71fc46dd0 R15: 0000000000000003 [ 312.816853] [ 312.816854] ---[ end trace 25d048475f484f4d ]--- [ 316.816865] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 316.816869] amdgpu: Resetting wave fronts (cpsch) on dev 00000000a08df1ec [ 368.819686] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 368.819692] amdgpu: Failed to evict process queues [ 368.819693] amdgpu: Failed to evict queues of pasid 0x8003 [ 368.819712] ------------[ cut here ]------------ [ 368.819714] WARNING: CPU: 11 PID: 2437 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230 [ 368.819736] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_hda_codec_realtek ip6t_rpfilter ipt_rpfilter snd_hda_codec_generic ledtrig_audio led_class snd_hda_codec_hdmi xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat snd_hda_intel snd_intel_dspcfg nft_counter intel_rapl_msr snd_intel_sdw_acpi snd_hda_codec edac_mce_amd evdev wmi_bmof mac_hid edac_core intel_rapl_common snd_hda_core crc32_pclmul ghash_clmulni_intel aesni_intel snd_hwdep snd_pcm libaes crypto_simd r8169 cryptd nf_tables rapl realtek snd_timer libcrc32c mdio_devres sp5100_tco watchdog snd sch_fq_codel nfnetlink libphy soundcore k10temp i2c_piix4 video gpio_amdpt gpio_generic pinctrl_amd tiny_power_button wmi acpi_cpufreq button ctr atkbd libps2 serio loop veth bridge stp llc tun [ 368.819797] vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata nvme usbcore crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight [ 368.819833] CPU: 11 PID: 2437 Comm: python Tainted: G W O 5.15.83 #1-NixOS [ 368.819836] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021 [ 368.819837] RIP: 0010:flush_work.isra.0+0x21f/0x230 [ 368.819840] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00 [ 368.819843] RSP: 0018:ffffb14001d07b28 EFLAGS: 00010246 [ 368.819844] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 368.819846] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9287271f1318 [ 368.819847] RBP: ffff9287271f1318 R08: 0000000000000000 R09: ffffffff96450b50 [ 368.819848] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9287271f1318 [ 368.819849] R13: 0000000000000001 R14: 0000000000000003 R15: ffff928705e528ac [ 368.819850] FS: 0000000000000000(0000) GS:ffff928a0e4c0000(0000) knlGS:0000000000000000 [ 368.819851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 368.819852] CR2: 00007f4631ed0000 CR3: 0000000106b2e000 CR4: 0000000000750ee0 [ 368.819853] PKRU: 55555554 [ 368.819854] Call Trace: [ 368.819857] [ 368.819858] ? cond_resched+0x31/0x50 [ 368.819865] ? wait_for_common+0x3b/0x160 [ 368.819866] ? srcu_gp_start_if_needed+0x23b/0x3e0 [ 368.819870] cancel_work_timer+0x11a/0x1b0 [ 368.819873] kfd_process_notifier_release+0x8b/0x160 [amdgpu] [ 368.820071] mmu_notifier_release+0x73/0x210 [ 368.820076] exit_mmap+0x1ad/0x1f0 [ 368.820079] ? delayacct_add_tsk+0x63/0x1b0 [ 368.820081] ? exit_robust_list+0x5c/0x140 [ 368.820083] ? cond_resched+0x16/0x50 [ 368.820084] ? mutex_lock+0xe/0x30 [ 368.820085] mmput+0x5a/0x140 [ 368.820088] do_exit+0x2f0/0xa40 [ 368.820089] do_group_exit+0x33/0xa0 [ 368.820090] get_signal+0x14a/0x910 [ 368.820093] arch_do_signal_or_restart+0x101/0x730 [ 368.820095] ? do_send_sig_info+0x6b/0xc0 [ 368.820096] ? do_tkill+0x88/0xb0 [ 368.820098] exit_to_user_mode_prepare+0x10e/0x230 [ 368.820099] syscall_exit_to_user_mode+0x18/0x40 [ 368.820102] do_syscall_64+0x48/0x90 [ 368.820103] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 368.820104] RIP: 0033:0x7f464524500b [ 368.820117] Code: Unable to access opcode bytes at RIP 0x7f4645244fe1. [ 368.820117] RSP: 002b:00007ffe5e59eb00 EFLAGS: 00000246 ORIG_RAX: 000000000000000e [ 368.820118] RAX: 0000000000000000 RBX: 00007f4645201340 RCX: 00007f464524500b [ 368.820119] RDX: 0000000000000000 RSI: 00007ffe5e59eb00 RDI: 0000000000000002 [ 368.820119] RBP: 00007ffe5e59ef70 R08: 0000000000000000 R09: 00007ffe5e59eb00 [ 368.820119] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001 [ 368.820120] R13: 00007ffe5e59ef00 R14: 00007f4631edd000 R15: 00007ffe5e59ef20 [ 368.820121] [ 368.820121] ---[ end trace 25d048475f484f4e ]--- [ 368.820176] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 368.820176] amdgpu: Resetting wave fronts (cpsch) on dev 00000000a08df1ec [ 390.784230] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 390.784238] amdgpu: Failed to evict process queues [ 390.784239] amdgpu: Failed to evict queues of pasid 0x8003 [ 390.784252] ------------[ cut here ]------------ [ 390.784254] WARNING: CPU: 2 PID: 2466 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230 [ 390.784260] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_hda_codec_realtek ip6t_rpfilter ipt_rpfilter snd_hda_codec_generic ledtrig_audio led_class snd_hda_codec_hdmi xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat snd_hda_intel snd_intel_dspcfg nft_counter intel_rapl_msr snd_intel_sdw_acpi snd_hda_codec edac_mce_amd evdev wmi_bmof mac_hid edac_core intel_rapl_common snd_hda_core crc32_pclmul ghash_clmulni_intel aesni_intel snd_hwdep snd_pcm libaes crypto_simd r8169 cryptd nf_tables rapl realtek snd_timer libcrc32c mdio_devres sp5100_tco watchdog snd sch_fq_codel nfnetlink libphy soundcore k10temp i2c_piix4 video gpio_amdpt gpio_generic pinctrl_amd tiny_power_button wmi acpi_cpufreq button ctr atkbd libps2 serio loop veth bridge stp llc tun [ 390.784296] vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata nvme usbcore crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight [ 390.784321] CPU: 2 PID: 2466 Comm: python Tainted: G W O 5.15.83 #1-NixOS [ 390.784322] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021 [ 390.784323] RIP: 0010:flush_work.isra.0+0x21f/0x230 [ 390.784326] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00 [ 390.784327] RSP: 0018:ffffb14001e17b28 EFLAGS: 00010246 [ 390.784328] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 390.784329] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff928718bf9318 [ 390.784330] RBP: ffff928718bf9318 R08: 0000000000000000 R09: ffffffff96450b50 [ 390.784330] R10: 0000000000000000 R11: 0000000000000000 R12: ffff928718bf9318 [ 390.784331] R13: 0000000000000001 R14: 0000000000000003 R15: ffff928705e5218c [ 390.784331] FS: 0000000000000000(0000) GS:ffff928a0e280000(0000) knlGS:0000000000000000 [ 390.784332] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 390.784333] CR2: 000055620ed12fe4 CR3: 00000001472ac000 CR4: 0000000000750ee0 [ 390.784334] PKRU: 55555554 [ 390.784335] Call Trace: [ 390.784336] [ 390.784337] ? cond_resched+0x31/0x50 [ 390.784341] ? wait_for_common+0x3b/0x160 [ 390.784343] ? srcu_gp_start_if_needed+0x23b/0x3e0 [ 390.784345] cancel_work_timer+0x11a/0x1b0 [ 390.784347] kfd_process_notifier_release+0x8b/0x160 [amdgpu] [ 390.784493] mmu_notifier_release+0x73/0x210 [ 390.784498] exit_mmap+0x1ad/0x1f0 [ 390.784501] ? delayacct_add_tsk+0x63/0x1b0 [ 390.784503] ? exit_robust_list+0x5c/0x140 [ 390.784505] ? __cond_resched+0x16/0x50 [ 390.784506] ? mutex_lock+0xe/0x30 [ 390.784507] mmput+0x5a/0x140 [ 390.784510] do_exit+0x2f0/0xa40 [ 390.784511] do_group_exit+0x33/0xa0 [ 390.784513] get_signal+0x14a/0x910 [ 390.784514] arch_do_signal_or_restart+0x101/0x730 [ 390.784517] ? do_send_sig_info+0x6b/0xc0 [ 390.784518] ? do_tkill+0x88/0xb0 [ 390.784519] exit_to_user_mode_prepare+0x10e/0x230 [ 390.784521] syscall_exit_to_user_mode+0x18/0x40 [ 390.784523] do_syscall_64+0x48/0x90 [ 390.784525] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 390.784527] RIP: 0033:0x7fd99a92700b [ 390.784540] Code: Unable to access opcode bytes at RIP 0x7fd99a926fe1. [ 390.784540] RSP: 002b:00007fff8bcc0520 EFLAGS: 00000246 ORIG_RAX: 000000000000000e [ 390.784542] RAX: 0000000000000000 RBX: 00007fd99a8e3340 RCX: 00007fd99a92700b [ 390.784542] RDX: 0000000000000000 RSI: 00007fff8bcc0520 RDI: 0000000000000002 [ 390.784543] RBP: 00007fff8bcc0990 R08: 0000000000000000 R09: 00007fff8bcc0520 [ 390.784543] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001 [ 390.784544] R13: 00007fff8bcc0920 R14: 00007fd935f15000 R15: 00007fff8bcc0940 [ 390.784545] [ 390.784545] ---[ end trace 25d048475f484f4f ]--- [ 390.784559] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 390.784560] amdgpu: Resetting wave fronts (cpsch) on dev 00000000a08df1ec



Edit 1: I am now switching it to staging. It didn't started build screaming (yet).

Madouura commented 1 year ago

About this generation thing I have no idea what I am doing xD just saw people mentioning this on the Internet and decided to try.

I'm in the same boat, it's how #197885 started lol. ~~Anyway, I think I gave you bad advice, while you should try staging and the other things, please try Flakebi's suggestion first, as it's likely what the actual problem is.~~ Nevermind, there it is, my bad reading comprehension again.

Madouura commented 1 year ago

rocm/pytorch:latest

Try without the latest tag, again this should just be an issue with the docker container.

lucasew commented 1 year ago

Same problem on staging

Madouura commented 1 year ago

I haven't gotten tensorflow working yet, but you should be able to use pytorch now when the next staging-next and #206995 is merged. If you wanna test now, see: https://github.com/Madouura/nixpkgs/commit/df71e711026a37178f9a258f236db0e1a66e2f0b You may need to add roctracer and rccl to LD_LIBRARY_PATH.

lucasew commented 1 year ago

I think I found a bug in nix shell

lucasew@whiterun ~ 0$ nix shell github:Madouura/nixpkgs/df71e711026a37178f9a258f236db0e1a66e2f0b#legacyPackages.x86_64-linux.{python3Packages.torchWithRocm,roctracer,rccl,python3} -c python 
Python 3.10.9 (main, Dec  6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'torch'

Madouura commented 1 year ago

I haven't gotten that problem, I may have linked you a bad build. Try https://github.com/Madouura/nixpkgs/commit/f6d4e98b49a52fe564b832e20527b527fa2c90a6.

Madouura commented 1 year ago

Oh, this is interesting. I didn't realize nix shell was supposed to propagate. That explains a lot and may be linked to some of the issues I've had in #206995.

lucasew commented 1 year ago

Tested with the following shell.nix (workaround of that issue)

{ pkgs ? import (builtins.fetchTarball "https://github.com/Madouura/nixpkgs/archive/f6d4e98b49a52fe564b832e20527b527fa2c90a6.tar.gz") {} }:
pkgs.mkShell {
  buildInputs = with pkgs; [
    python3Packages.torchWithRocm
  ];
}

Same problem of the container so far. But I returned to stable. I will try with the latest staging commit.

Madouura commented 1 year ago

Try this. nix-shell -I nixpkgs=${nixpkgs-at-f6d4e98b49a52fe564b832e20527b527fa2c90a6} -p python3Packages.torchWithRocm python ./benchmark.py

import torch, timeit

print(f"CUDA support: {torch.cuda.is_available()} (Should be \"True\")")
print(f"CUDA version: {torch.version.cuda} (Should be \"None\")")
print(f"HIP version: {torch.version.hip} (Should contain \"5.4\")")

# Storing ID of current CUDA device
cuda_id = torch.cuda.current_device()
print(f"Current CUDA device ID: {torch.cuda.current_device()}")
print(f"Current CUDA device name: {torch.cuda.get_device_name(cuda_id)} (Should be AMD, not NVIDIA)")

def batched_dot_mul_sum(a, b):
    '''Computes batched dot by multiplying and summing'''
    return a.mul(b).sum(-1)

def batched_dot_bmm(a, b):
    '''Computes batched dot by reducing to bmm'''
    a = a.reshape(-1, 1, a.shape[-1])
    b = b.reshape(-1, b.shape[-1], 1)
    return torch.bmm(a, b).flatten(-3)

x = torch.randn(10000, 1024, device='cuda')

t0 = timeit.Timer(
    stmt='batched_dot_mul_sum(x, x)',
    setup='from __main__ import batched_dot_mul_sum',
    globals={'x': x})

t1 = timeit.Timer(
    stmt='batched_dot_bmm(x, x)',
    setup='from __main__ import batched_dot_bmm',
    globals={'x': x})

# Ran each twice to show difference before/after warmup
print(f'mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us')

If everything is working, everything should match what's in the parenthesis and if you have something like corectrl, you'll see a GPU frequency spike when it is running.

Madouura commented 1 year ago

If that still doesn't work, it may honestly just be possible that the Ryzen 5600G just isn't supported. It theoretically should be though, since it's Vega IIRC.

Madouura commented 1 year ago

@Flakebi If you have an AMD GPU, could you run this check/benchmark as well to confirm it isn't just working for me and only me?

lucasew commented 1 year ago

Same problem.

Built my NixOS config against the staging right after #206421 was merged because the latest staging failed in the middle of the build because of an unrelated package.

This is the shell.nix I am using to provision torch based on the commit you mentioned:

let
  nixpkgs = builtins.fetchTarball "https://github.com/NixOS/nixpkgs/archive/f6d4e98b49a52fe564b832e20527b527fa2c90a6.tar.gz";
  pkgs = import nixpkgs { };
in pkgs.mkShell {
  buildInputs = with pkgs; [ python3Packages.torchWithRocm ];
}

This is my Python prompt after nix-shell the shell.nix above

lucasew@whiterun ~/demo-hip-issue 0$ nix-shell
(shell:impure) lucasew@whiterun ~/demo-hip-issue 0$ python
Python 3.10.9 (main, Dec  6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')
... 
... )
Memory access fault by GPU node-1 (Agent handle: 0x7817470) on address 0x735d000. Reason: Unknown.
Aborted (imagem do núcleo gravada)

Whiterun is running https://github.com/lucasew/nixcfg/tree/811c58b6b9c743fab692fb3fc7817ded83974b6c

And this is what I got in the dmesg right after I ran that Python snippet.

[  292.842655] gmc_v9_0_process_interrupt: 34 callbacks suppressed
[  292.842658] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842662] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842670] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00801031
[  292.842670] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: TCP (0x8)
[  292.842671] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x1
[  292.842672] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842672] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  292.842673] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842673] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842675] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842677] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842683] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842684] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842684] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842685] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842685] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842686] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842686] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842687] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842689] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842695] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842695] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842696] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842696] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842697] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842697] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842698] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842698] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842699] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842705] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842706] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842706] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842707] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842707] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842708] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842708] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842709] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842710] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842716] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842716] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842717] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842717] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842718] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842718] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842719] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842720] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842721] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842726] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842727] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842728] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842728] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842728] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842729] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842729] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842730] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842731] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842737] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842738] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842738] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842739] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842739] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842740] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842740] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842741] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842742] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842745] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842745] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842746] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842746] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842747] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842747] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842748] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842750] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842751] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842754] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842754] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842755] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842755] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842756] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842756] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842757] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  292.842758] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842759] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842762] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842763] amdgpu 0000:07:00.0: amdgpu:      Faulty UTCL2 client ID: CB (0x0)
[  292.842763] amdgpu 0000:07:00.0: amdgpu:      MORE_FAULTS: 0x0
[  292.842764] amdgpu 0000:07:00.0: amdgpu:      WALKER_ERROR: 0x0
[  292.842764] amdgpu 0000:07:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  292.842765] amdgpu 0000:07:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  292.842765] amdgpu 0000:07:00.0: amdgpu:      RW: 0x0
[  294.367109] ------------[ cut here ]------------
[  294.367112] WARNING: CPU: 10 PID: 5999 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230
[  294.367118] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables snd_hda_codec_realtek xt_conntrack nf_conntrack snd_hda_codec_generic nf_defrag_ipv6 ledtrig_audio led_class nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi ip6t_rpfilter intel_rapl_msr ipt_rpfilter snd_hda_codec edac_mce_amd edac_core wmi_bmof snd_hda_core intel_rapl_common xt_pkttype crc32_pclmul ghash_clmulni_intel evdev snd_hwdep mac_hid aesni_intel xt_LOG snd_pcm nf_log_syslog r8169 libaes crypto_simd cryptd xt_tcpudp sp5100_tco watchdog realtek nft_compat snd_timer rapl mdio_devres nft_counter snd k10temp i2c_piix4 libphy wmi soundcore video gpio_amdpt tiny_power_button gpio_generic pinctrl_amd button acpi_cpufreq nf_tables libcrc32c nfnetlink sch_fq_codel ctr atkbd libps2 serio loop veth bridge stp llc tun
[  294.367154]  vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata usbcore nvme crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  294.367178] CPU: 10 PID: 5999 Comm: python Tainted: G           O      5.15.83 #1-NixOS
[  294.367180] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  294.367181] RIP: 0010:__flush_work.isra.0+0x21f/0x230
[  294.367183] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00
[  294.367184] RSP: 0018:ffffb6b381d9fb28 EFLAGS: 00010246
[  294.367186] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  294.367186] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff953eb5a54718
[  294.367187] RBP: ffff953eb5a54718 R08: 0000000000000000 R09: ffffffff99650b50
[  294.367187] R10: 0000000000000000 R11: 0000000000000000 R12: ffff953eb5a54718
[  294.367188] R13: 0000000000000001 R14: 0000000000000003 R15: ffff953e98cb7bac
[  294.367189] FS:  0000000000000000(0000) GS:ffff95418e280000(0000) knlGS:0000000000000000
[  294.367190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  294.367190] CR2: 00007fb0193dbff8 CR3: 00000001014b6000 CR4: 0000000000750ee0
[  294.367191] PKRU: 55555554
[  294.367192] Call Trace:
[  294.367194]  <TASK>
[  294.367196]  ? del_timer+0x55/0x80
[  294.367199]  __cancel_work_timer+0x11a/0x1b0
[  294.367201]  kfd_process_notifier_release+0x8b/0x160 [amdgpu]
[  294.367338]  __mmu_notifier_release+0x73/0x210
[  294.367342]  exit_mmap+0x1ad/0x1f0
[  294.367345]  ? delayacct_add_tsk+0x63/0x1b0
[  294.367347]  ? exit_robust_list+0x5c/0x140
[  294.367349]  ? __cond_resched+0x16/0x50
[  294.367351]  ? mutex_lock+0xe/0x30
[  294.367353]  mmput+0x5a/0x140
[  294.367356]  do_exit+0x2f0/0xa40
[  294.367357]  do_group_exit+0x33/0xa0
[  294.367358]  get_signal+0x14a/0x910
[  294.367360]  arch_do_signal_or_restart+0x101/0x730
[  294.367363]  ? do_send_sig_info+0x6b/0xc0
[  294.367364]  ? do_tkill+0x88/0xb0
[  294.367365]  exit_to_user_mode_prepare+0x10e/0x230
[  294.367367]  syscall_exit_to_user_mode+0x18/0x40
[  294.367369]  do_syscall_64+0x48/0x90
[  294.367371]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  294.367373] RIP: 0033:0x7fb1e899cbc7
[  294.367389] Code: Unable to access opcode bytes at RIP 0x7fb1e899cb9d.
[  294.367389] RSP: 002b:00007fb0193deb30 EFLAGS: 00000246 ORIG_RAX: 00000000000000ea
[  294.367390] RAX: 0000000000000000 RBX: 000000000000176f RCX: 00007fb1e899cbc7
[  294.367391] RDX: 0000000000000006 RSI: 000000000000176f RDI: 0000000000001756
[  294.367392] RBP: 0000000001e90d08 R08: 00007fb0193df948 R09: 0000000000000020
[  294.367392] R10: 0000000000000008 R11: 0000000000000246 R12: 00007fb0193ded58
[  294.367393] R13: 0000000000000000 R14: 0000000000000006 R15: 0000000001e90d88
[  294.367394]  </TASK>
[  294.367394] ---[ end trace 511b8352d6af64c6 ]---
[  294.382835] ------------[ cut here ]------------
[  294.382836] WARNING: CPU: 10 PID: 1650 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2db/0x300 [ttm]
[  294.382843] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables snd_hda_codec_realtek xt_conntrack nf_conntrack snd_hda_codec_generic nf_defrag_ipv6 ledtrig_audio led_class nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi ip6t_rpfilter intel_rapl_msr ipt_rpfilter snd_hda_codec edac_mce_amd edac_core wmi_bmof snd_hda_core intel_rapl_common xt_pkttype crc32_pclmul ghash_clmulni_intel evdev snd_hwdep mac_hid aesni_intel xt_LOG snd_pcm nf_log_syslog r8169 libaes crypto_simd cryptd xt_tcpudp sp5100_tco watchdog realtek nft_compat snd_timer rapl mdio_devres nft_counter snd k10temp i2c_piix4 libphy wmi soundcore video gpio_amdpt tiny_power_button gpio_generic pinctrl_amd button acpi_cpufreq nf_tables libcrc32c nfnetlink sch_fq_codel ctr atkbd libps2 serio loop veth bridge stp llc tun
[  294.382866]  vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata usbcore nvme crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  294.382882] CPU: 10 PID: 1650 Comm: kworker/10:3 Tainted: G        W  O      5.15.83 #1-NixOS
[  294.382883] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  294.382884] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  294.382993] RIP: 0010:ttm_bo_release+0x2db/0x300 [ttm]
[  294.382996] Code: e8 9a 46 2e d8 e9 bb fd ff ff 49 8b 7e 98 b9 30 75 00 00 31 d2 be 01 00 00 00 e8 a0 68 2e d8 49 8b 46 e8 eb 9e 48 89 e8 eb 99 <0f> 0b e9 46 fd ff ff e8 99 44 2e d8 e9 ed fe ff ff be 03 00 00 00
[  294.382997] RSP: 0018:ffffb6b381df7cb8 EFLAGS: 00010202
[  294.382998] RAX: 0000000000000001 RBX: ffffb6b381df7d00 RCX: 0000000080400035
[  294.382999] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff953eb5a531b8
[  294.382999] RBP: ffff953e8a285240 R08: ffff953eb5a531b8 R09: 0000000000000000
[  294.383000] R10: ffff953e9e038540 R11: 0000000000000000 R12: ffff953eaffb7e30
[  294.383000] R13: ffff953eb5a53058 R14: ffff953eb5a531b8 R15: dead000000000100
[  294.383001] FS:  0000000000000000(0000) GS:ffff95418e280000(0000) knlGS:0000000000000000
[  294.383002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  294.383002] CR2: 00007fb0193dbff8 CR3: 000000004be10000 CR4: 0000000000750ee0
[  294.383003] PKRU: 55555554
[  294.383003] Call Trace:
[  294.383005]  <TASK>
[  294.383006]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  294.383071]  amdgpu_gem_object_free+0x30/0x50 [amdgpu]
[  294.383135]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34f/0x3c0 [amdgpu]
[  294.383211]  kfd_process_device_free_bos+0x9d/0xe0 [amdgpu]
[  294.383281]  kfd_process_wq_release+0x20d/0x2d0 [amdgpu]
[  294.383348]  process_one_work+0x1f1/0x390
[  294.383351]  worker_thread+0x53/0x3e0
[  294.383352]  ? process_one_work+0x390/0x390
[  294.383353]  kthread+0x127/0x150
[  294.383354]  ? set_kthread_struct+0x50/0x50
[  294.383355]  ret_from_fork+0x22/0x30
[  294.383357]  </TASK>
[  294.383358] ---[ end trace 511b8352d6af64c7 ]---

And this is your script output:

(shell:impure) lucasew@whiterun ~/demo-hip-issue 0$ python test-pytorch
CUDA support: True (Should be "True")
CUDA version: None (Should be "None")
HIP version: 5.4.22802-0 (Should contain "5.4")
Current CUDA device ID: 0
Current CUDA device name: AMD Radeon Graphics (Should be AMD, not NVIDIA)
Segmentation fault (imagem do núcleo gravada)

Madouura commented 1 year ago

So it's not torch itself, the commit, or nixpkgs then, everything as far as torch goes matches up. I honestly would suggest you take this up with AMD, the closest thing I can think of considering all the errors I've seen would be https://github.com/RadeonOpenCompute/ROCm-Device-Libs. You're still using the machine with the 5600G right?

Madouura commented 1 year ago

I do have my user in the "video" and "render" groups, just in case that solves your issue, but I doubt it. https://www.gabriel.urdhr.fr/2022/08/28/trying-to-run-stable-diffusion-on-amd-ryzen-5-5600g also suggests adding your user to "render".

lucasew commented 1 year ago

I haven't at that time my user in video and render group then I added it. Same problem.

And yeah, 5600G B450, less than a year and got it working with Blender.

BTW those segfaults are hell to debug.

Captura de tela_2022-12-28_07-49-50

lucasew commented 1 year ago

I think I got something \o/

(shell:impure) lucasew@whiterun ~/demo-hip-issue 139$ HSA_OVERRIDE_GFX_VERSION=9.0.0 ./test-pytorch 
CUDA support: True (Should be "True")
CUDA version: None (Should be "None")
HIP version: 5.4.22802-0 (Should contain "5.4")
Current CUDA device ID: 0
Current CUDA device name: AMD Radeon Graphics (Should be AMD, not NVIDIA)
mul_sum(x, x):  131.0 us
mul_sum(x, x):    9.2 us
bmm(x, x):      330.2 us
bmm(x, x):       18.9 us

Madouura commented 1 year ago

Ahh so it was HSA_OVERRIDE_GFX_VERSION=9.0.0 and maybe the render group after all.

Madouura commented 1 year ago

Try both of those (and video, for good measure) with the docker image, theoretically it should work.

lucasew commented 1 year ago

Tried to replicate with a fresh reboot.

Same result.

We got it :clinking_glasses:

For the registry, whiterun is running https://github.com/lucasew/nixcfg/commit/d98b0e24a0e17527457badfa221cff630e53ac26 and I added the group definitions in the bootstrap node, so it propagates to all others.

Captura de tela_2022-12-28_07-58-32

Madouura commented 1 year ago

Glad we got it working! Gonna close since this isn't a nixpkgs issue, but if there's anything else I can help with, let me know.

lucasew commented 1 year ago

Well, the issue is actually about the official containers. These are still not working.

Captura de tela_2022-12-28_08-09-23

Madouura commented 1 year ago

The official docker containers, right? That's not nixpkgs-related. I'm not sure why those wouldn't be working. Maybe docker itself needs to be added to video and render in your nix config?

Madouura commented 1 year ago

...unless this is related to your nix shell issue, but I don't see how that could be...

Madouura commented 1 year ago

You could also try adding --ipc=host to your docker arguments. See: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs

lucasew commented 1 year ago

The example that got working is based on nix-shell not nix shell.

--ipc=host is already there.

The full docker run command is docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined 67a4

67a4 is a container generated from the rocm/pytorch but with the user added to the render group.

BTW, that torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')) is still failing.

Madouura commented 1 year ago

Ugh, reading comprehension again... Anyway, I've gotten the stable diffusion (webui) docker container working so I'm not sure why the pytorch one isn't working. I'm afraid I'm out of ideas as far as docker goes, I still don't think this is a nixpkgs issues but in case it's an issue with docker... cc (docker maintainers) @offlinehacker @tailhook @vdemeester @periklis @mikroskeem @maxeaubrey

Madouura commented 1 year ago

BTW, that torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')) is still failing.

With torchWithRocm, right? Works for me? Screenshot from 2022-12-28 05-26-01

mikroskeem commented 1 year ago

I don't think that Docker gets into way anymore that much here, because right device nodes appear to be bound from host, and stock seccomp profile which could block syscalls is disabled as well (seccomp=unconfined). docker run configuration is following what upstream wiki says, unless they're out of date, it should work exactly the same.

Have you looked into stable-diffusion-webui issues about those segfaults? Maybe those give few pointers:

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6032
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4870

I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).

lucasew commented 1 year ago

I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).

Well, isn't the steam deck gpu basically an RDNA2 GPU? That should work.

@Madouura what's your hardware and where you define the GPU stuff in your config? I may have done mistakes in my config. But yeah, it's based on that staging commit.

Madouura commented 1 year ago

I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).

Well, isn't the steam deck gpu basically an RDNA2 GPU? That should work.

@Madouura what's your hardware and where you define the GPU stuff in your config? I may have done mistakes in my config. But yeah, it's based on that staging commit.

Hopefully this should be enough. One is 6900XT, other is 6800. Screenshot from 2022-12-28 06-05-40 These should be relevant:

Madouura commented 1 year ago

Wait a minute... The likely reason why our torch is working and the official docker image isn't working is probably this... https://github.com/NixOS/nixpkgs/blob/0f0929f4aa73b731130be5f9ebe7426eb4c0661d/pkgs/development/libraries/rocclr/default.nix#L19-L27 IIRC shouldn't the 5600g be gfx8? If so, that's definitely why. The official docker image isn't an option for you.

Madouura commented 1 year ago

Nope, I got that wrong. It's gfx9. See: https://github.com/RadeonOpenCompute/ROCm/blob/77cbac4abab13046ee93d8b5bf410684caf91145/README.md#library-target-matrix

lucasew commented 1 year ago

I just updated my kernel to linuxPackages_6_0. I was using the default (5.15).

It seems that the stuff is working now, even the container.

Screenshot_20221228-105429

mikroskeem commented 1 year ago

I suppose this issue can be closed now?

lucasew commented 1 year ago

I just want to test tensorflow before. But if the ROCm layer is known to be working then I suppose no more work is needed in this issue for you to do. Thank you guys. You are awesome.

Madouura commented 1 year ago

Looks like there were some AMD changes in 6.0, go figure. Glad we could help.

NixOS / nixpkgs