Closed lucasew closed 1 year ago
Can you use the hip from nixos-unstable and tell me if it still gives you that error? Looking through my PRs concerning the ROCm packages I can't find anything that could cause this aside from possibly #206421, and that's not in master yet. Also try using hip from staging (#206421) if you can, see if that works. From what I can see, you're using a docker container and that should have it's own hip, which may be the problem instead of nixpkg's hip.
I should also mention that I am working on native ROCm support for pytorch and tensorflow in nixpkgs so you don't need to use those docker containers, but that's going to take some time.
Also try export HSA_OVERRIDE_GFX_VERSION=9.0.0
instead.
As far as I see, a Ryzen 5600G has a Vega GPU (gfx9), so I’m not surprised that everything crashes when you force gfx10.3 behavior – two generations later – with HSA_OVERRIDE_GFX_VERSION=10.3.0
:)
It seems to be a gfx90c card, so HSA_OVERRIDE_GFX_VERSION=9.0.12
should be more correct.
As far as I see, a Ryzen 5600G has a Vega GPU (gfx9), so I’m not surprised that everything crashes when you force gfx10.3 behavior – two generations later – with
HSA_OVERRIDE_GFX_VERSION=10.3.0
:) It seems to be a gfx90c card, soHSA_OVERRIDE_GFX_VERSION=9.0.12
should be more correct.
About this generation thing I have no idea what I am doing xD just saw people mentioning this on the Internet and decided to try.
Can you use the hip from nixos-unstable and tell me if it still gives you that error? Looking through my PRs concerning the ROCm packages I can't find anything that could cause this aside from possibly #206421, and that's not in master yet. Also try using hip from staging (#206421) if you can, see if that works. From what I can see, you're using a docker container and that should have it's own hip, which may be the problem instead of nixpkg's hip.
Switched to latest unstable rn
HSA_OVERRIDE_GFX_VERSION=9.0.12
and HSA_OVERRIDE_GFX_VERSION=9.0.0
>>> import torch
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Aborted (core dumped)
- `HSA_OVERRIDE_GFX_VERSION=10.3.0`
[ 306.174866] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[ 306.174872] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[ 306.174879] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x008012B1
[ 306.174881] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: SQC (inst) (0x9)
[ 306.174882] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1
[ 306.174883] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 306.174884] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0xb
[ 306.174885] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 306.174886] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 306.174889] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[ 306.174891] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[ 306.174898] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 306.174899] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 306.174900] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 306.174901] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 306.174902] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 306.174903] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 306.174904] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 306.174906] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[ 306.174907] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[ 306.174914] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 306.174915] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 306.174916] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 306.174917] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 306.174918] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 306.174918] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 306.174919] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 306.174922] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[ 306.174924] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[ 306.174931] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 306.174931] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 306.174932] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 306.174933] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 306.174934] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 306.174935] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 306.174936] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 306.174937] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[ 306.174939] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[ 306.174945] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 306.174946] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 306.174947] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 306.174948] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 306.174949] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 306.174950] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 306.174951] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 306.174952] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[ 306.174954] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[ 306.174960] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 306.174961] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 306.174962] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 306.174963] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 306.174964] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 306.174965] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 306.174965] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 306.174967] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[ 306.174968] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[ 306.174975] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 306.174976] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 306.174977] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 306.174977] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 306.174978] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 306.174979] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 306.174980] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 306.174981] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[ 306.174983] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[ 306.174989] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 306.174990] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 306.174991] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 306.174992] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 306.174993] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 306.174994] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 306.174995] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 310.174910] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[ 310.174915] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 4, err_type 2
[ 310.174918] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 4, err_type 2
[ 310.174919] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 4, err_type 2
[ 310.174920] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 4, err_type 2
[ 310.174921] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 3, err_type 2
[ 310.174922] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 3, err_type 2
[ 310.174923] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 6, err_type 2
[ 310.174923] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 6, err_type 2
[ 310.174924] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 6, err_type 2
[ 310.174925] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 6, err_type 2
[ 310.174926] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 5, err_type 2
[ 310.174927] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 5, err_type 2
[ 310.174927] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 5, err_type 2
[ 310.174928] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 5, err_type 2
[ 310.174929] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 3, err_type 2
[ 310.174930] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 3, err_type 2
[ 310.174931] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 2, err_type 2
[ 310.174931] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 2, err_type 2
[ 310.174932] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 2, err_type 2
[ 310.174933] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 2, err_type 2
[ 310.174934] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 1, err_type 2
[ 310.174935] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 1, err_type 2
[ 310.174936] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 1, err_type 2
[ 310.174936] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 1, err_type 2
[ 310.174937] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 0, err_type 2
[ 310.174938] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 0, err_type 2
[ 310.174939] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 0, err_type 2
[ 310.174940] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 0, err_type 2
[ 312.816528] ------------[ cut here ]------------
[ 312.816531] WARNING: CPU: 2 PID: 2329 at kernel/workqueue.c:3083 flush_work.isra.0+0x21f/0x230
[ 312.816537] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_hda_codec_realtek ip6t_rpfilter ipt_rpfilter snd_hda_codec_generic ledtrig_audio led_class snd_hda_codec_hdmi xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat snd_hda_intel snd_intel_dspcfg nft_counter intel_rapl_msr snd_intel_sdw_acpi snd_hda_codec edac_mce_amd evdev wmi_bmof mac_hid edac_core intel_rapl_common snd_hda_core crc32_pclmul ghash_clmulni_intel aesni_intel snd_hwdep snd_pcm libaes crypto_simd r8169 cryptd nf_tables rapl realtek snd_timer libcrc32c mdio_devres sp5100_tco watchdog snd sch_fq_codel nfnetlink libphy soundcore k10temp i2c_piix4 video gpio_amdpt gpio_generic pinctrl_amd tiny_power_button wmi acpi_cpufreq button ctr atkbd libps2 serio loop veth bridge stp llc tun
[ 312.816570] vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata nvme usbcore crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[ 312.816595] CPU: 2 PID: 2329 Comm: python Tainted: G W O 5.15.83 #1-NixOS
[ 312.816597] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[ 312.816598] RIP: 0010:flush_work.isra.0+0x21f/0x230
[ 312.816600] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00
[ 312.816601] RSP: 0018:ffffb14001cb7b28 EFLAGS: 00010246
[ 312.816602] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 312.816603] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff92872a69ab18
[ 312.816604] RBP: ffff92872a69ab18 R08: 0000000000000000 R09: ffffffff96450b50
[ 312.816604] R10: 0000000000000000 R11: 0000000000000000 R12: ffff92872a69ab18
[ 312.816605] R13: 0000000000000001 R14: 0000000000000003 R15: ffff928705e5272c
[ 312.816606] FS: 0000000000000000(0000) GS:ffff928a0e280000(0000) knlGS:0000000000000000
[ 312.816606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 312.816607] CR2: 0000000000d133a0 CR3: 000000012388a000 CR4: 0000000000750ee0
[ 312.816608] PKRU: 55555554
[ 312.816609] Call Trace:
[ 312.816611]
Edit 1: I am now switching it to staging. It didn't started build screaming (yet).
About this generation thing I have no idea what I am doing xD just saw people mentioning this on the Internet and decided to try.
I'm in the same boat, it's how #197885 started lol.
Anyway, I think I gave you bad advice, while you should try staging and the other things, please try Flakebi's suggestion first, as it's likely what the actual problem is.
Nevermind, there it is, my bad reading comprehension again.
rocm/pytorch:latest
Try without the latest tag, again this should just be an issue with the docker container.
Same problem on staging
I haven't gotten tensorflow working yet, but you should be able to use pytorch now when the next staging-next and #206995 is merged.
If you wanna test now, see: https://github.com/Madouura/nixpkgs/commit/df71e711026a37178f9a258f236db0e1a66e2f0b
You may need to add roctracer
and rccl
to LD_LIBRARY_PATH
.
I think I found a bug in nix shell
lucasew@whiterun ~ 0$ nix shell github:Madouura/nixpkgs/df71e711026a37178f9a258f236db0e1a66e2f0b#legacyPackages.x86_64-linux.{python3Packages.torchWithRocm,roctracer,rccl,python3} -c python
Python 3.10.9 (main, Dec 6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'torch'
I haven't gotten that problem, I may have linked you a bad build. Try https://github.com/Madouura/nixpkgs/commit/f6d4e98b49a52fe564b832e20527b527fa2c90a6.
Oh, this is interesting. I didn't realize nix shell
was supposed to propagate. That explains a lot and may be linked to some of the issues I've had in #206995.
Tested with the following shell.nix (workaround of that issue)
{ pkgs ? import (builtins.fetchTarball "https://github.com/Madouura/nixpkgs/archive/f6d4e98b49a52fe564b832e20527b527fa2c90a6.tar.gz") {} }:
pkgs.mkShell {
buildInputs = with pkgs; [
python3Packages.torchWithRocm
];
}
Same problem of the container so far. But I returned to stable. I will try with the latest staging commit.
Try this.
nix-shell -I nixpkgs=${nixpkgs-at-f6d4e98b49a52fe564b832e20527b527fa2c90a6} -p python3Packages.torchWithRocm
python ./benchmark.py
import torch, timeit
print(f"CUDA support: {torch.cuda.is_available()} (Should be \"True\")")
print(f"CUDA version: {torch.version.cuda} (Should be \"None\")")
print(f"HIP version: {torch.version.hip} (Should contain \"5.4\")")
# Storing ID of current CUDA device
cuda_id = torch.cuda.current_device()
print(f"Current CUDA device ID: {torch.cuda.current_device()}")
print(f"Current CUDA device name: {torch.cuda.get_device_name(cuda_id)} (Should be AMD, not NVIDIA)")
def batched_dot_mul_sum(a, b):
'''Computes batched dot by multiplying and summing'''
return a.mul(b).sum(-1)
def batched_dot_bmm(a, b):
'''Computes batched dot by reducing to bmm'''
a = a.reshape(-1, 1, a.shape[-1])
b = b.reshape(-1, b.shape[-1], 1)
return torch.bmm(a, b).flatten(-3)
x = torch.randn(10000, 1024, device='cuda')
t0 = timeit.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x})
t1 = timeit.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x})
# Ran each twice to show difference before/after warmup
print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')
If everything is working, everything should match what's in the parenthesis and if you have something like corectrl
, you'll see a GPU frequency spike when it is running.
If that still doesn't work, it may honestly just be possible that the Ryzen 5600G just isn't supported. It theoretically should be though, since it's Vega IIRC.
@Flakebi If you have an AMD GPU, could you run this check/benchmark as well to confirm it isn't just working for me and only me?
Same problem.
Built my NixOS config against the staging right after #206421 was merged because the latest staging failed in the middle of the build because of an unrelated package.
This is the shell.nix I am using to provision torch based on the commit you mentioned:
let
nixpkgs = builtins.fetchTarball "https://github.com/NixOS/nixpkgs/archive/f6d4e98b49a52fe564b832e20527b527fa2c90a6.tar.gz";
pkgs = import nixpkgs { };
in pkgs.mkShell {
buildInputs = with pkgs; [ python3Packages.torchWithRocm ];
}
This is my Python prompt after nix-shell the shell.nix above
lucasew@whiterun ~/demo-hip-issue 0$ nix-shell
(shell:impure) lucasew@whiterun ~/demo-hip-issue 0$ python
Python 3.10.9 (main, Dec 6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')
...
... )
Memory access fault by GPU node-1 (Agent handle: 0x7817470) on address 0x735d000. Reason: Unknown.
Aborted (imagem do núcleo gravada)
Whiterun is running https://github.com/lucasew/nixcfg/tree/811c58b6b9c743fab692fb3fc7817ded83974b6c
And this is what I got in the dmesg right after I ran that Python snippet.
[ 292.842655] gmc_v9_0_process_interrupt: 34 callbacks suppressed
[ 292.842658] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842662] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842670] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00801031
[ 292.842670] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[ 292.842671] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1
[ 292.842672] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842672] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 292.842673] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842673] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842675] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842677] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842683] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842684] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842684] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842685] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842685] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842686] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842686] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842687] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842689] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842695] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842695] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842696] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842696] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842697] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842697] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842698] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842698] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842699] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842705] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842706] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842706] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842707] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842707] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842708] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842708] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842709] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842710] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842716] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842716] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842717] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842717] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842718] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842718] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842719] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842720] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842721] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842726] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842727] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842728] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842728] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842728] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842729] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842729] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842730] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842731] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842737] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842738] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842738] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842739] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842739] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842740] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842740] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842741] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842742] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842745] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842745] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842746] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842746] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842747] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842747] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842748] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842750] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842751] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842754] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842754] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842755] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842755] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842756] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842756] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842757] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 292.842758] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[ 292.842759] amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[ 292.842762] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 292.842763] amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 292.842763] amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
[ 292.842764] amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
[ 292.842764] amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 292.842765] amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 292.842765] amdgpu 0000:07:00.0: amdgpu: RW: 0x0
[ 294.367109] ------------[ cut here ]------------
[ 294.367112] WARNING: CPU: 10 PID: 5999 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230
[ 294.367118] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables snd_hda_codec_realtek xt_conntrack nf_conntrack snd_hda_codec_generic nf_defrag_ipv6 ledtrig_audio led_class nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi ip6t_rpfilter intel_rapl_msr ipt_rpfilter snd_hda_codec edac_mce_amd edac_core wmi_bmof snd_hda_core intel_rapl_common xt_pkttype crc32_pclmul ghash_clmulni_intel evdev snd_hwdep mac_hid aesni_intel xt_LOG snd_pcm nf_log_syslog r8169 libaes crypto_simd cryptd xt_tcpudp sp5100_tco watchdog realtek nft_compat snd_timer rapl mdio_devres nft_counter snd k10temp i2c_piix4 libphy wmi soundcore video gpio_amdpt tiny_power_button gpio_generic pinctrl_amd button acpi_cpufreq nf_tables libcrc32c nfnetlink sch_fq_codel ctr atkbd libps2 serio loop veth bridge stp llc tun
[ 294.367154] vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata usbcore nvme crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[ 294.367178] CPU: 10 PID: 5999 Comm: python Tainted: G O 5.15.83 #1-NixOS
[ 294.367180] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[ 294.367181] RIP: 0010:__flush_work.isra.0+0x21f/0x230
[ 294.367183] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00
[ 294.367184] RSP: 0018:ffffb6b381d9fb28 EFLAGS: 00010246
[ 294.367186] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 294.367186] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff953eb5a54718
[ 294.367187] RBP: ffff953eb5a54718 R08: 0000000000000000 R09: ffffffff99650b50
[ 294.367187] R10: 0000000000000000 R11: 0000000000000000 R12: ffff953eb5a54718
[ 294.367188] R13: 0000000000000001 R14: 0000000000000003 R15: ffff953e98cb7bac
[ 294.367189] FS: 0000000000000000(0000) GS:ffff95418e280000(0000) knlGS:0000000000000000
[ 294.367190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 294.367190] CR2: 00007fb0193dbff8 CR3: 00000001014b6000 CR4: 0000000000750ee0
[ 294.367191] PKRU: 55555554
[ 294.367192] Call Trace:
[ 294.367194] <TASK>
[ 294.367196] ? del_timer+0x55/0x80
[ 294.367199] __cancel_work_timer+0x11a/0x1b0
[ 294.367201] kfd_process_notifier_release+0x8b/0x160 [amdgpu]
[ 294.367338] __mmu_notifier_release+0x73/0x210
[ 294.367342] exit_mmap+0x1ad/0x1f0
[ 294.367345] ? delayacct_add_tsk+0x63/0x1b0
[ 294.367347] ? exit_robust_list+0x5c/0x140
[ 294.367349] ? __cond_resched+0x16/0x50
[ 294.367351] ? mutex_lock+0xe/0x30
[ 294.367353] mmput+0x5a/0x140
[ 294.367356] do_exit+0x2f0/0xa40
[ 294.367357] do_group_exit+0x33/0xa0
[ 294.367358] get_signal+0x14a/0x910
[ 294.367360] arch_do_signal_or_restart+0x101/0x730
[ 294.367363] ? do_send_sig_info+0x6b/0xc0
[ 294.367364] ? do_tkill+0x88/0xb0
[ 294.367365] exit_to_user_mode_prepare+0x10e/0x230
[ 294.367367] syscall_exit_to_user_mode+0x18/0x40
[ 294.367369] do_syscall_64+0x48/0x90
[ 294.367371] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 294.367373] RIP: 0033:0x7fb1e899cbc7
[ 294.367389] Code: Unable to access opcode bytes at RIP 0x7fb1e899cb9d.
[ 294.367389] RSP: 002b:00007fb0193deb30 EFLAGS: 00000246 ORIG_RAX: 00000000000000ea
[ 294.367390] RAX: 0000000000000000 RBX: 000000000000176f RCX: 00007fb1e899cbc7
[ 294.367391] RDX: 0000000000000006 RSI: 000000000000176f RDI: 0000000000001756
[ 294.367392] RBP: 0000000001e90d08 R08: 00007fb0193df948 R09: 0000000000000020
[ 294.367392] R10: 0000000000000008 R11: 0000000000000246 R12: 00007fb0193ded58
[ 294.367393] R13: 0000000000000000 R14: 0000000000000006 R15: 0000000001e90d88
[ 294.367394] </TASK>
[ 294.367394] ---[ end trace 511b8352d6af64c6 ]---
[ 294.382835] ------------[ cut here ]------------
[ 294.382836] WARNING: CPU: 10 PID: 1650 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2db/0x300 [ttm]
[ 294.382843] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables snd_hda_codec_realtek xt_conntrack nf_conntrack snd_hda_codec_generic nf_defrag_ipv6 ledtrig_audio led_class nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi ip6t_rpfilter intel_rapl_msr ipt_rpfilter snd_hda_codec edac_mce_amd edac_core wmi_bmof snd_hda_core intel_rapl_common xt_pkttype crc32_pclmul ghash_clmulni_intel evdev snd_hwdep mac_hid aesni_intel xt_LOG snd_pcm nf_log_syslog r8169 libaes crypto_simd cryptd xt_tcpudp sp5100_tco watchdog realtek nft_compat snd_timer rapl mdio_devres nft_counter snd k10temp i2c_piix4 libphy wmi soundcore video gpio_amdpt tiny_power_button gpio_generic pinctrl_amd button acpi_cpufreq nf_tables libcrc32c nfnetlink sch_fq_codel ctr atkbd libps2 serio loop veth bridge stp llc tun
[ 294.382866] vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata usbcore nvme crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[ 294.382882] CPU: 10 PID: 1650 Comm: kworker/10:3 Tainted: G W O 5.15.83 #1-NixOS
[ 294.382883] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[ 294.382884] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[ 294.382993] RIP: 0010:ttm_bo_release+0x2db/0x300 [ttm]
[ 294.382996] Code: e8 9a 46 2e d8 e9 bb fd ff ff 49 8b 7e 98 b9 30 75 00 00 31 d2 be 01 00 00 00 e8 a0 68 2e d8 49 8b 46 e8 eb 9e 48 89 e8 eb 99 <0f> 0b e9 46 fd ff ff e8 99 44 2e d8 e9 ed fe ff ff be 03 00 00 00
[ 294.382997] RSP: 0018:ffffb6b381df7cb8 EFLAGS: 00010202
[ 294.382998] RAX: 0000000000000001 RBX: ffffb6b381df7d00 RCX: 0000000080400035
[ 294.382999] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff953eb5a531b8
[ 294.382999] RBP: ffff953e8a285240 R08: ffff953eb5a531b8 R09: 0000000000000000
[ 294.383000] R10: ffff953e9e038540 R11: 0000000000000000 R12: ffff953eaffb7e30
[ 294.383000] R13: ffff953eb5a53058 R14: ffff953eb5a531b8 R15: dead000000000100
[ 294.383001] FS: 0000000000000000(0000) GS:ffff95418e280000(0000) knlGS:0000000000000000
[ 294.383002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 294.383002] CR2: 00007fb0193dbff8 CR3: 000000004be10000 CR4: 0000000000750ee0
[ 294.383003] PKRU: 55555554
[ 294.383003] Call Trace:
[ 294.383005] <TASK>
[ 294.383006] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[ 294.383071] amdgpu_gem_object_free+0x30/0x50 [amdgpu]
[ 294.383135] amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34f/0x3c0 [amdgpu]
[ 294.383211] kfd_process_device_free_bos+0x9d/0xe0 [amdgpu]
[ 294.383281] kfd_process_wq_release+0x20d/0x2d0 [amdgpu]
[ 294.383348] process_one_work+0x1f1/0x390
[ 294.383351] worker_thread+0x53/0x3e0
[ 294.383352] ? process_one_work+0x390/0x390
[ 294.383353] kthread+0x127/0x150
[ 294.383354] ? set_kthread_struct+0x50/0x50
[ 294.383355] ret_from_fork+0x22/0x30
[ 294.383357] </TASK>
[ 294.383358] ---[ end trace 511b8352d6af64c7 ]---
And this is your script output:
(shell:impure) lucasew@whiterun ~/demo-hip-issue 0$ python test-pytorch
CUDA support: True (Should be "True")
CUDA version: None (Should be "None")
HIP version: 5.4.22802-0 (Should contain "5.4")
Current CUDA device ID: 0
Current CUDA device name: AMD Radeon Graphics (Should be AMD, not NVIDIA)
Segmentation fault (imagem do núcleo gravada)
So it's not torch itself, the commit, or nixpkgs then, everything as far as torch goes matches up. I honestly would suggest you take this up with AMD, the closest thing I can think of considering all the errors I've seen would be https://github.com/RadeonOpenCompute/ROCm-Device-Libs. You're still using the machine with the 5600G right?
I do have my user in the "video" and "render" groups, just in case that solves your issue, but I doubt it. https://www.gabriel.urdhr.fr/2022/08/28/trying-to-run-stable-diffusion-on-amd-ryzen-5-5600g also suggests adding your user to "render".
I haven't at that time my user in video and render group then I added it. Same problem.
And yeah, 5600G B450, less than a year and got it working with Blender.
BTW those segfaults are hell to debug.
I think I got something \o/
(shell:impure) lucasew@whiterun ~/demo-hip-issue 139$ HSA_OVERRIDE_GFX_VERSION=9.0.0 ./test-pytorch
CUDA support: True (Should be "True")
CUDA version: None (Should be "None")
HIP version: 5.4.22802-0 (Should contain "5.4")
Current CUDA device ID: 0
Current CUDA device name: AMD Radeon Graphics (Should be AMD, not NVIDIA)
mul_sum(x, x): 131.0 us
mul_sum(x, x): 9.2 us
bmm(x, x): 330.2 us
bmm(x, x): 18.9 us
Ahh so it was HSA_OVERRIDE_GFX_VERSION=9.0.0
and maybe the render
group after all.
Try both of those (and video
, for good measure) with the docker image, theoretically it should work.
Tried to replicate with a fresh reboot.
Same result.
We got it :clinking_glasses:
For the registry, whiterun is running https://github.com/lucasew/nixcfg/commit/d98b0e24a0e17527457badfa221cff630e53ac26 and I added the group definitions in the bootstrap
node, so it propagates to all others.
Glad we got it working! Gonna close since this isn't a nixpkgs issue, but if there's anything else I can help with, let me know.
Well, the issue is actually about the official containers. These are still not working.
The official docker containers, right? That's not nixpkgs-related.
I'm not sure why those wouldn't be working.
Maybe docker itself needs to be added to video
and render
in your nix config?
...unless this is related to your nix shell
issue, but I don't see how that could be...
You could also try adding --ipc=host
to your docker arguments.
See: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs
The example that got working is based on nix-shell
not nix shell
.
--ipc=host
is already there.
The full docker run command is docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined 67a4
67a4 is a container generated from the rocm/pytorch
but with the user added to the render group.
BTW, that torch.tensor([[1,2],[3,4]]).to(torch.device('cuda'))
is still failing.
Ugh, reading comprehension again... Anyway, I've gotten the stable diffusion (webui) docker container working so I'm not sure why the pytorch one isn't working. I'm afraid I'm out of ideas as far as docker goes, I still don't think this is a nixpkgs issues but in case it's an issue with docker... cc (docker maintainers) @offlinehacker @tailhook @vdemeester @periklis @mikroskeem @maxeaubrey
BTW, that
torch.tensor([[1,2],[3,4]]).to(torch.device('cuda'))
is still failing.
With torchWithRocm
, right? Works for me?
I don't think that Docker gets into way anymore that much here, because right device nodes appear to be bound from host, and stock seccomp profile which could block syscalls is disabled as well (seccomp=unconfined
). docker run
configuration is following what upstream wiki says, unless they're out of date, it should work exactly the same.
Have you looked into stable-diffusion-webui issues about those segfaults? Maybe those give few pointers:
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6032
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4870
I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).
I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).
Well, isn't the steam deck gpu basically an RDNA2 GPU? That should work.
@Madouura what's your hardware and where you define the GPU stuff in your config? I may have done mistakes in my config. But yeah, it's based on that staging commit.
I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).
Well, isn't the steam deck gpu basically an RDNA2 GPU? That should work.
@Madouura what's your hardware and where you define the GPU stuff in your config? I may have done mistakes in my config. But yeah, it's based on that staging commit.
Hopefully this should be enough. One is 6900XT, other is 6800. These should be relevant:
Wait a minute... The likely reason why our torch
is working and the official docker image isn't working is probably this...
https://github.com/NixOS/nixpkgs/blob/0f0929f4aa73b731130be5f9ebe7426eb4c0661d/pkgs/development/libraries/rocclr/default.nix#L19-L27
IIRC shouldn't the 5600g be gfx8? If so, that's definitely why.
The official docker image isn't an option for you.
Nope, I got that wrong. It's gfx9. See: https://github.com/RadeonOpenCompute/ROCm/blob/77cbac4abab13046ee93d8b5bf410684caf91145/README.md#library-target-matrix
I just updated my kernel to linuxPackages_6_0. I was using the default (5.15).
It seems that the stuff is working now, even the container.
I suppose this issue can be closed now?
I just want to test tensorflow before. But if the ROCm layer is known to be working then I suppose no more work is needed in this issue for you to do. Thank you guys. You are awesome.
Looks like there were some AMD changes in 6.0, go figure. Glad we could help.
Describe the bug
I have a Ryzen 5600G APU and I am trying to use Tensorflow or PyTorch to do some machine learning stuff. So far whatever one, I am just trying to make it recognize the GPU and make it usable, and so far I was only able to use it on Blender with blender-hip or a workaround to use it with blender-bin.
Steps To Reproduce
Steps to reproduce the behavior:
For PyTorch
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/pytorch:latest
python
import torch
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" Aborted (core dumped)
For TensorFlow
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/tensorflow:latest
python
import tensorflow as tf
tf.config.list_physical_devices()
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" Aborted (core dumped)
If I do an
export HSA_OVERRIDE_GFX_VERSION=10.3.0
and do any activity that actually uses the GPU, liketorch.tensor([[1,2],[3,4]]).to(torch.device('cuda')
it crashes and dmesg shows the following:Expected behavior
Machine learning working the same as it would work in Google Colab I guess
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Nixcfg revision used to replicate the issue: https://github.com/lucasew/nixcfg/tree/ff430dc0992d9247989f739a326536f87e345d98/nodes/whiterun
A PC with a i5 6400 + RX460 has the same problem but I don't have access to it anymore to test eventual fixes.
Notify maintainers
@NixOS/rocm-maintainers
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result.