ROCm / ROCR-Runtime

ROCm Platform Runtime: ROCr a HPC market enhanced HSA based runtime
https://rocm.docs.amd.com/projects/ROCR-Runtime/en/latest/
Other
218 stars 108 forks source link

SIGSEGV, Segmentation fault from /opt/rocm/hsa/lib/libhsa-runtime64.so.1 #68

Closed drwetter closed 1 month ago

drwetter commented 5 years ago

Hi,

recently with external programs I have difficulties:

prompt:~ 0#  clinfo 
Segmentation fault
prompt:~ 139# gdb clinfo 
NU gdb (GDB; openSUSE Tumbleweed) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from clinfo...
Missing separate debuginfo for /usr/bin/clinfo
Try: zypper install -C "debuginfo(build-id)=552db0e18d0b01ce77fea81502069ff4470c6f84"
(No debugging symbols found in clinfo)
(gdb) r
Starting program: /usr/bin/clinfo 
[..]
New Thread 0x7ffff1cfc700 (LWP 13343)]
[New Thread 0x7ffff14fb700 (LWP 13344)]
[New Thread 0x7ffff0cfa700 (LWP 13345)]
[New Thread 0x7ffee3fff700 (LWP 13346)]
[New Thread 0x7ffee37fe700 (LWP 13347)]
[New Thread 0x7ffee2ffd700 (LWP 13348)]
[New Thread 0x7ffee27fc700 (LWP 13349)]
[New Thread 0x7ffee1ffb700 (LWP 13350)]

Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
0x00007ffff2cab8d5 in std::_Function_handler<core::Queue* (), amd::GpuAgent::InitDma()::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1

Same with hashcat e.g. .

prompt:~ 0#  rpm -qf /opt/rocm/hsa/lib/libhsa-runtime64.so.1
hsa-rocr-dev-1.1.9_99_g835b876a-1.x86_64
prompt:~ 0#  uname -rsp  
Linux 5.2.8-1-default x86_64
prompt:~ 0#  grep -w NAME /etc/os-release 
NAME="openSUSE Tumbleweed"
prompt:~ 0# cat /proc/cpuinfo  | grep 'model name' | uniq
model name      : AMD Ryzen 5 2400G with Radeon Vega Graphics
prompt:~ 0#

Stock AMDGPU driver is loaded. It did work with kernel 5.1.x and the previous set of RPMs from this project.

The code at https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/8ea15e12ee4760dc6ec394841a5de8e8b9e8c845/src/core/runtime/amd_gpu_agent.cpp#L570 didn't help me .

Any advice?

Thanks, Dirk

skeelyamd commented 5 years ago

Could you try running with a gdb and a debug build of ROCr? I'm wondering if perhaps a memory allocation has failed and where we expect an exception to be thrown it's returning null instead.

As to why something has gone wrong it's likely an install or configuration error. rocminfo checks for some common config issues. Can you run that and see if it gives some warning?

drwetter commented 5 years ago

Are there binaries which I can use or would I have to compile them myself with debug information?

Wrt to the exception: my C++ is a bit rusty but I was wondering about that too.

I'll get back later with the rocminfo output.

Cheers, Dirk

drwetter commented 5 years ago

rocminfo didn't display any warnings. Attached it for you...

rocm.txt

drwetter commented 5 years ago

FYI: Still the same with kernel 5.2.10

Aug 31 18:17:04 REDACTED kernel: [10732.387789] clinfo[9538]: segfault at 1000 ip 00007f3e299348d5 sp 00007ffc11040150 error 6 in libhsa-runtime64.so.1.1.9[7f3e29915000+c7000]
Aug 31 18:17:04 REDACTED kernel: [10732.387794] Code: de 4d 89 e0 50 6a 00 e8 09 62 00 00 49 8b 46 40 48 8b 93 18 04 00 00 59 49 8b 4e 18 4c 89 75 a8 48 8b 40 08 5e 25 f8 1f 00 00 <48> 89 0c 10 48 8b 45 a8 48 85 c0 74 54 48 8d 15 bf dd 2a 00 48 8b
Lucretia commented 5 years ago

I've been testing the builds on Gentoo and I'm getting the same as you @drwetter but with a different Code part as can be seen above.

I thought I'd add my machine details here also, as it's older than most people's and it mentioned as should be working but not tested well on the main docs page where it asks for bug reports.

Linux rogue 5.1.15-gentoo #3 SMP PREEMPT Tue Aug 6 12:48:09 BST 2019 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

My GPU is an R9 390.

Lucretia commented 5 years ago

Ok, just built a new kernel, sys-kernel/gentoo-sources-5.2.9. I didn't add the patches I mentioned in my link above. Running clinfo from text console as root gave me:

Aug 31 19:52:47 rogue kernel: clinfo[7833]: segfault at 1000 ip 00007f1f5836b192 sp 00007ffdf9c807c0 error 6 in libhsa-runtime64.so.1.1.9[7f1f5834c000+cb000]
Aug 31 19:52:47 rogue kernel: Code: ff ff ff 48 8b 85 58 ff ff ff 48 8b 80 18 04 00 00 48 8b 95 78 ff ff ff 48 c1 e2 03 48 01 c2 48 8b 85 68 ff ff ff 48 8b 40 18 <48> 89 02 c6 45 b0 01 90 bb 00 00 00 00 0f b6 45 b0 83 f0 01 84 c0
Aug 31 19:52:47 rogue kernel: BUG: kernel NULL pointer dereference, address: 0000000000000038
Aug 31 19:52:47 rogue kernel: #PF: supervisor read access in kernel mode
Aug 31 19:52:47 rogue kernel: #PF: error_code(0x0000) - not-present page
Aug 31 19:52:47 rogue kernel: PGD 5fe852067 P4D 5fe852067 PUD 736f05067 PMD 0 
Aug 31 19:52:47 rogue kernel: Oops: 0000 [#1] PREEMPT SMP
Aug 31 19:52:47 rogue kernel: CPU: 6 PID: 7833 Comm: clinfo Not tainted 5.2.9-gentoo #1
Aug 31 19:52:47 rogue kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99FX PRO R2.0, BIOS 2301 01/06/2014
Aug 31 19:52:47 rogue kernel: RIP: 0010:amdgpu_ib_schedule+0x48/0x449 [amdgpu]
Aug 31 19:52:47 rogue kernel: Code: 08 4c 89 44 24 28 45 85 ed 0f 84 0a 04 00 00 49 89 fe 48 89 cb 48 85 c9 74 1a 48 8b 81 90 00 00 00 48 89 44 24 18 48 8b 41 10 <48> 8b 40 38 48 89 04 24 eb 11 48 c7 44 24 18 00 00 00 00 48 c7 04
Aug 31 19:52:47 rogue kernel: RSP: 0018:ffff8e87c1e07ad8 EFLAGS: 00010286
Aug 31 19:52:47 rogue kernel: RAX: 0000000000000000 RBX: ffff8d8615201400 RCX: ffff8d8615201400
Aug 31 19:52:47 rogue kernel: RDX: ffff8d8615201610 RSI: 0000000000000001 RDI: ffff8d8738766c28
Aug 31 19:52:47 rogue kernel: RBP: ffff8d8738760000 R08: ffff8e87c1e07b50 R09: 0000000000000007
Aug 31 19:52:47 rogue kernel: R10: 071c71c71c71c71c R11: ffff8d883eba86c0 R12: 00000000ffffffea
Aug 31 19:52:47 rogue kernel: R13: 0000000000000001 R14: ffff8d8738766c28 R15: ffff8d861d764000
Aug 31 19:52:47 rogue kernel: FS:  00007f1f5888cb80(0000) GS:ffff8d883eb80000(0000) knlGS:0000000000000000
Aug 31 19:52:47 rogue kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 31 19:52:47 rogue kernel: CR2: 0000000000000038 CR3: 00000005fe902000 CR4: 00000000000406e0
Aug 31 19:52:47 rogue kernel: Call Trace:
Aug 31 19:52:47 rogue kernel:  amdgpu_amdkfd_submit_ib+0xda/0x15a [amdgpu]
Aug 31 19:52:47 rogue kernel:  deallocate_vmid+0x93/0xea [amdgpu]
Aug 31 19:52:47 rogue kernel:  destroy_queue_nocpsch_locked+0x153/0x170 [amdgpu]
Aug 31 19:52:47 rogue kernel:  process_termination_nocpsch+0x5b/0x10d [amdgpu]
Aug 31 19:52:47 rogue kernel:  kfd_process_dequeue_from_device+0x25/0x2f [amdgpu]
Aug 31 19:52:47 rogue kernel:  kfd_process_dequeue_from_all_devices+0x1d/0x26 [amdgpu]
Aug 31 19:52:47 rogue kernel:  kfd_process_notifier_release+0x119/0x158 [amdgpu]
Aug 31 19:52:47 rogue kernel:  __mmu_notifier_release+0x3c/0xb9
Aug 31 19:52:47 rogue kernel:  exit_mmap+0x29/0x147
Aug 31 19:52:47 rogue kernel:  ? ___cache_free+0x2c/0x16c
Aug 31 19:52:47 rogue kernel:  ? do_coredump+0xbf1/0xcaa
Aug 31 19:52:47 rogue kernel:  ? __khugepaged_exit+0x72/0x106
Aug 31 19:52:47 rogue kernel:  mmput+0x38/0xcf
Aug 31 19:52:47 rogue kernel:  do_exit+0x3bc/0x9b4
Aug 31 19:52:47 rogue kernel:  do_group_exit+0x95/0x95
Aug 31 19:52:47 rogue kernel:  get_signal+0x708/0x72e
Aug 31 19:52:47 rogue kernel:  ? _raw_spin_lock_irqsave+0x14/0x33
Aug 31 19:52:47 rogue kernel:  ? _raw_spin_unlock_irqrestore+0xf/0x20
Aug 31 19:52:47 rogue kernel:  ? try_to_wake_up+0x360/0x387
Aug 31 19:52:47 rogue kernel:  do_signal+0x2b/0x50e
Aug 31 19:52:47 rogue kernel:  ? kick_process+0x4f/0x62
Aug 31 19:52:47 rogue kernel:  ? __send_signal+0x214/0x30a
Aug 31 19:52:47 rogue kernel:  ? send_signal+0x64/0x10c
Aug 31 19:52:47 rogue kernel:  ? force_sig_info+0xbf/0xcd
Aug 31 19:52:47 rogue kernel:  exit_to_usermode_loop+0x38/0xa4
Aug 31 19:52:47 rogue kernel:  ? page_fault+0x8/0x30
Aug 31 19:52:47 rogue kernel:  prepare_exit_to_usermode+0x66/0x91
Aug 31 19:52:47 rogue kernel:  retint_user+0x8/0x8
Aug 31 19:52:47 rogue kernel: RIP: 0033:0x7f1f5836b192
Aug 31 19:52:47 rogue kernel: Code: ff ff ff 48 8b 85 58 ff ff ff 48 8b 80 18 04 00 00 48 8b 95 78 ff ff ff 48 c1 e2 03 48 01 c2 48 8b 85 68 ff ff ff 48 8b 40 18 <48> 89 02 c6 45 b0 01 90 bb 00 00 00 00 0f b6 45 b0 83 f0 01 84 c0
Aug 31 19:52:47 rogue kernel: RSP: 002b:00007ffdf9c807c0 EFLAGS: 00010206
Aug 31 19:52:47 rogue kernel: RAX: 000000000100b000 RBX: 0000558bba502e00 RCX: 0000000000000000
Aug 31 19:52:47 rogue kernel: RDX: 0000000000001000 RSI: 000000000000ffff RDI: 0000000000000005
Aug 31 19:52:47 rogue kernel: RBP: 00007ffdf9c80890 R08: 0000000000000000 R09: 0000000000000000
Aug 31 19:52:47 rogue kernel: R10: 00007f1f587ceca0 R11: 0000000000000000 R12: 0000000000000001
Aug 31 19:52:47 rogue kernel: R13: 0000558bba5d8c40 R14: 0000000000009568 R15: 0000000000000000
Aug 31 19:52:47 rogue kernel: Modules linked in: nfsd nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo br_netfilter cmac bnep snd_seq_dummy snd_seq_oss snd_pcm_oss snd_mixer_oss 8021q binfmt_misc ip6table_nat ip6_tables twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx_x86_64 serpent_generic cast6_avx_x86_64 cast6_generic cast5_avx_x86_64 cast5_generic cast_common sha512_ssse3 sha256_ssse3 sha1_ssse3 hfsplus hfs cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_ondemand uinput tun bridge stp llc ipv6 crc_ccitt 9pnet_virtio 9p 9pnet vfio_pci vfio_virqfd snd_seq_midi snd_seq_midi_event snd_seq msr cpuid joydev input_leds hid_uclogic kvm_amd kvm it87 irqbypass hwmon_vid crct10dif_pclmul crc32_pclmul btusb btrtl btbcm btintel bluetooth ecdh_generic ecc rfkill snd_usb_audio snd_usbmidi_lib snd_rawmidi hid_steam snd_seq_device media uas aesni_intel crypto_simd cryptd glue_helper pcspkr k10temp fam15h_power snd_hda_codec_realtek snd_hda_codec_generic amdgpu i2c_piix4
Aug 31 19:52:47 rogue kernel:  snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm amd_iommu_v2 pcc_cpufreq snd_timer gpu_sched ttm snd button acpi_cpufreq virtio_pci virtio_scsi virtio_blk virtio_console virtio_balloon aes_x86_64 sha512_generic libiscsi scsi_transport_iscsi vxlan ip6_udp_tunnel udp_tunnel macvlan virtio_net net_failover failover virtio_ring virtio pcnet32 mii fuse nfs lockd grace sunrpc jfs multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log hid_sony ff_memless hid_logitech hid_a4tech sl811_hcd uhci_hcd aic94xx libsas lpfc crc_t10dif qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc mptspi mptscsih mptbase atp870u dc395x qla1280 imm parport dmx3191d sym53c8xx gdth initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sg pdc_adma sata_inic162x sata_mv ata_piix sata_qstor sata_vsc
Aug 31 19:52:47 rogue kernel:  sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw sata_sil24 sata_sil sata_promise pata_sl82c105 pata_via pata_jmicron pata_marvell pata_sis pata_netcell pata_pdc202xx_old pata_triflex pata_atiixp pata_opti pata_amd pata_ali pata_it8213 pata_pcmcia pcmcia pcmcia_core pata_ns87415 pata_ns87410 pata_serverworks pata_artop pata_it821x pata_optidma pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366 pata_cmd64x pata_efar pata_rz1000 pata_sil680 pata_radisys pata_pdc2027x pata_mpiix dm_mod dax led_class usb_storage crct10dif_common r8169 ahci libahci libphy libata
Aug 31 19:52:47 rogue kernel: CR2: 0000000000000038
Aug 31 19:52:47 rogue kernel: ---[ end trace 9ffc0c42faacf8cc ]---
Aug 31 19:52:47 rogue kernel: RIP: 0010:amdgpu_ib_schedule+0x48/0x449 [amdgpu]
Aug 31 19:52:47 rogue kernel: Code: 08 4c 89 44 24 28 45 85 ed 0f 84 0a 04 00 00 49 89 fe 48 89 cb 48 85 c9 74 1a 48 8b 81 90 00 00 00 48 89 44 24 18 48 8b 41 10 <48> 8b 40 38 48 89 04 24 eb 11 48 c7 44 24 18 00 00 00 00 48 c7 04
Aug 31 19:52:47 rogue kernel: RSP: 0018:ffff8e87c1e07ad8 EFLAGS: 00010286
Aug 31 19:52:47 rogue kernel: RAX: 0000000000000000 RBX: ffff8d8615201400 RCX: ffff8d8615201400
Aug 31 19:52:47 rogue kernel: RDX: ffff8d8615201610 RSI: 0000000000000001 RDI: ffff8d8738766c28
Aug 31 19:52:47 rogue kernel: RBP: ffff8d8738760000 R08: ffff8e87c1e07b50 R09: 0000000000000007
Aug 31 19:52:47 rogue kernel: R10: 071c71c71c71c71c R11: ffff8d883eba86c0 R12: 00000000ffffffea
Aug 31 19:52:47 rogue kernel: R13: 0000000000000001 R14: ffff8d8738766c28 R15: ffff8d861d764000
Aug 31 19:52:47 rogue kernel: FS:  00007f1f5888cb80(0000) GS:ffff8d883eb80000(0000) knlGS:0000000000000000
Aug 31 19:52:47 rogue kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 31 19:52:47 rogue kernel: CR2: 0000000000000038 CR3: 00000005fe902000 CR4: 00000000000406e0
Aug 31 19:52:47 rogue kernel: Fixing recursive fault but reboot is needed!
fxkamd commented 5 years ago

@Lucretia, the kernel oops is happening on a code path specific to Hawaii GPUs. It doesn't get any regular testing here. I can try to reproduce it locally. This problem is unrelated to the one reported by @drwetter.

Lucretia commented 5 years ago

@fxkamd yes, if I apply those patches from my machine status 72 above, I get the same segfault at @drwetter above with the 2 lines including the "code: ..."

skeelyamd commented 5 years ago

@drwetter, you would need to compile a debug build yourself. It should be a straightforward process. The cmake only needs to be pointed at libhsakmt (usually /opt/rocm/include and /opt/rocm/lib). Standard cmake cache config tools can be used (ccmake, cmake-gui) to set those parameters.

I'll also see if I can reproduce this error locally.

Lucretia commented 5 years ago

hsakm

But the message says the error is coming from libhsa-runtime64.so.1 which isn't open.

skeelyamd commented 5 years ago

libhsa-runtime64.so is ROCr (this project).


From: Luke A. Guest notifications@github.com Sent: Saturday, August 31, 2019 10:15:15 PM To: RadeonOpenCompute/ROCR-Runtime ROCR-Runtime@noreply.github.com Cc: Keely, Sean Sean.Keely@amd.com; Comment comment@noreply.github.com Subject: Re: [RadeonOpenCompute/ROCR-Runtime] SIGSEGV, Segmentation fault from /opt/rocm/hsa/lib/libhsa-runtime64.so.1 (#68)

hsakm

But the message says the error is coming from libhsa-runtime64.so.1 which isn't open.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/RadeonOpenCompute/ROCR-Runtime/issues/68?email_source=notifications&email_token=ADBZXXZ4I6CLRGRAUFIFD5DQHMXUHA5CNFSM4IOVDXC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5TZKXA#issuecomment-526882140, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADBZXX4ZKDXBXGEBLPLTNGTQHMXUHANCNFSM4IOVDXCQ.

Lucretia commented 5 years ago

@skeelyamd I was getting confused with another package on Gentoo that's binary only. I'll see if I can build it in debug and get a trace back for you.

Lucretia commented 5 years ago

@skeelyamd I recompiled dev-util/clinfo and dev-libs/rocr-runtime with debugging enabled, GDB gives the folllowing:

(gdb) r
Starting program: /usr/bin/clinfo 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff744d700 (LWP 21391)]
[New Thread 0x7fffea517700 (LWP 21392)]
[New Thread 0x7fffe9d16700 (LWP 21393)]
[New Thread 0x7fffe9515700 (LWP 21394)]
[New Thread 0x7fffe8d14700 (LWP 21395)]
[New Thread 0x7fffe3fff700 (LWP 21396)]
[New Thread 0x7fffe37fe700 (LWP 21397)]
[New Thread 0x7fffe2ffd700 (LWP 21398)]
[New Thread 0x7fffe27fc700 (LWP 21399)]

Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
0x00007ffff7aca192 in amd::GpuAgent::QueueCreate(unsigned long, unsigned int, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, unsigned int, unsigned int, core::Queue**) () from /usr/lib64/libhsa-runtime64.so.1
(gdb) bt
#0  0x00007ffff7aca192 in amd::GpuAgent::QueueCreate(unsigned long, unsigned int, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, unsigned int, unsigned int, core::Queue**) () from /usr/lib64/libhsa-runtime64.so.1
#1  0x00007ffff7ac8951 in amd::GpuAgent::CreateInterceptibleQueue() () from /usr/lib64/libhsa-runtime64.so.1
#2  0x00007ffff7ac8bb4 in amd::GpuAgent::InitDma()::{lambda()#1}::operator()() const () from /usr/lib64/libhsa-runtime64.so.1
#3  0x00007ffff7acb977 in std::_Function_handler<core::Queue* (), amd::GpuAgent::InitDma()::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/lib64/libhsa-runtime64.so.1
#4  0x00007ffff7ad0200 in std::function<core::Queue* ()>::operator()() const () from /usr/lib64/libhsa-runtime64.so.1
#5  0x00007ffff7acfaf1 in lazy_ptr<core::Queue>::make_body(bool) const () from /usr/lib64/libhsa-runtime64.so.1
#6  0x00007ffff7ace746 in lazy_ptr<core::Queue>::operator->() const () from /usr/lib64/libhsa-runtime64.so.1
#7  0x00007ffff7acb147 in amd::GpuAgent::InvalidateCodeCaches() () from /usr/lib64/libhsa-runtime64.so.1
#8  0x00007ffff7ad799f in amd::LoaderContext::SegmentAlloc(amdgpu_hsa_elf_segment_t, hsa_agent_s, unsigned long, unsigned long, bool) () from /usr/lib64/libhsa-runtime64.so.1
#9  0x00007ffff7b39187 in amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(hsa_agent_s, amd::hsa::code::AmdHsaCode const*) () from /usr/lib64/libhsa-runtime64.so.1
#10 0x00007ffff7b38ff5 in amd::hsa::loader::ExecutableImpl::LoadSegments(hsa_agent_s, amd::hsa::code::AmdHsaCode const*, unsigned int) () from /usr/lib64/libhsa-runtime64.so.1
#11 0x00007ffff7b38a82 in amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, unsigned long, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#12 0x00007ffff7b37dcd in amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#13 0x00007ffff7af8026 in HSA::hsa_executable_load_agent_code_object(hsa_executable_s, hsa_agent_s, hsa_code_object_reader_s, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#14 0x00007ffff7b335b6 in hsa_executable_load_agent_code_object () from /usr/lib64/libhsa-runtime64.so.1
#15 0x00007ffff7ca72b7 in ?? () from /usr/lib64/libamdocl64.so
#16 0x00007ffff7c81fb6 in ?? () from /usr/lib64/libamdocl64.so
#17 0x00007ffff7c7e077 in ?? () from /usr/lib64/libamdocl64.so
#18 0x00007ffff7c9036f in ?? () from /usr/lib64/libamdocl64.so
#19 0x00007ffff7c6b24a in ?? () from /usr/lib64/libamdocl64.so
#20 0x00007ffff7cada7e in ?? () from /usr/lib64/libamdocl64.so
#21 0x00007ffff7caeb99 in ?? () from /usr/lib64/libamdocl64.so
#22 0x00007ffff7c68841 in ?? () from /usr/lib64/libamdocl64.so
#23 0x00007ffff7c89cd6 in ?? () from /usr/lib64/libamdocl64.so
#24 0x00007ffff7cdf0c5 in ?? () from /usr/lib64/libamdocl64.so
#25 0x00007ffff7d45e57 in __pthread_once_slow () from /lib64/libpthread.so.0
#26 0x00007ffff7cdf1dc in clIcdGetPlatformIDsKHR () from /usr/lib64/libamdocl64.so
#27 0x00007ffff7f2f15d in ?? () from /usr/lib64/OpenCL/vendors/ocl-icd/libOpenCL.so.1
#28 0x00007ffff7d45e57 in __pthread_once_slow () from /lib64/libpthread.so.0
#29 0x00007ffff7f2e428 in ?? () from /usr/lib64/OpenCL/vendors/ocl-icd/libOpenCL.so.1
#30 0x00007ffff7f31234 in clGetPlatformIDs () from /usr/lib64/OpenCL/vendors/ocl-icd/libOpenCL.so.1
#31 0x000055555555a420 in main (argc=<optimized out>, argv=<optimized out>) at src/clinfo.c:3190
JHirte commented 5 years ago

I see the same on Raven Ridge:

Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
0x00007ffff7af51c2 in amd::GpuAgent::QueueCreate(unsigned long, unsigned int, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, unsigned int, unsigned int, core::Queue**) () from /usr/lib64/libhsa-runtime64.so.1
(gdb) bt
#0  0x00007ffff7af51c2 in amd::GpuAgent::QueueCreate(unsigned long, unsigned int, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, unsigned int, unsigned int, core::Queue**) () from /usr/lib64/libhsa-runtime64.so.1
#1  0x00007ffff7af3981 in amd::GpuAgent::CreateInterceptibleQueue() () from /usr/lib64/libhsa-runtime64.so.1
#2  0x00007ffff7af3be4 in amd::GpuAgent::InitDma()::{lambda()#1}::operator()() const () from /usr/lib64/libhsa-runtime64.so.1
#3  0x00007ffff7af69a7 in std::_Function_handler<core::Queue* (), amd::GpuAgent::InitDma()::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/lib64/libhsa-runtime64.so.1
#4  0x00007ffff7afb230 in std::function<core::Queue* ()>::operator()() const () from /usr/lib64/libhsa-runtime64.so.1
#5  0x00007ffff7afab21 in lazy_ptr<core::Queue>::make_body(bool) const () from /usr/lib64/libhsa-runtime64.so.1
#6  0x00007ffff7af9776 in lazy_ptr<core::Queue>::operator->() const () from /usr/lib64/libhsa-runtime64.so.1
#7  0x00007ffff7af6177 in amd::GpuAgent::InvalidateCodeCaches() () from /usr/lib64/libhsa-runtime64.so.1
#8  0x00007ffff7b029cf in amd::LoaderContext::SegmentAlloc(amdgpu_hsa_elf_segment_t, hsa_agent_s, unsigned long, unsigned long, bool) () from /usr/lib64/libhsa-runtime64.so.1
#9  0x00007ffff7b641b7 in amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(hsa_agent_s, amd::hsa::code::AmdHsaCode const*) () from /usr/lib64/libhsa-runtime64.so.1
#10 0x00007ffff7b64025 in amd::hsa::loader::ExecutableImpl::LoadSegments(hsa_agent_s, amd::hsa::code::AmdHsaCode const*, unsigned int) () from /usr/lib64/libhsa-runtime64.so.1
#11 0x00007ffff7b63ab2 in amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, unsigned long, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#12 0x00007ffff7b62dfd in amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#13 0x00007ffff7b23056 in HSA::hsa_executable_load_agent_code_object(hsa_executable_s, hsa_agent_s, hsa_code_object_reader_s, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#14 0x00007ffff7b5e5e6 in hsa_executable_load_agent_code_object () from /usr/lib64/libhsa-runtime64.so.1
#15 0x00007ffff7cd5c5f in roc::LightningProgram::setKernels (this=0x55555567be90, options=0x7fffffffced0, binary=<optimized out>, binSize=39728)
    at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/runtime/device/rocm/rocprogram.cpp:473
#16 0x00007ffff7cae85e in device::Program::linkImplLC (this=0x55555567be90, options=0x7fffffffced0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/ext/new_allocator.h:89
#17 0x00007ffff7cac092 in device::Program::build (this=0x55555567be90, 
    sourceCode="extern void __amd_copyBufferRect(__global uchar*, __global uchar*, ulong4, ulong4, ulong4); extern void __amd_copyBufferRectAligned(__global uint*, __global uint*, ulong4, ulong4, ulong4); extern void"..., 
    origOptions=origOptions@entry=0x555555676da0 "-cl-internal-kernel  -fno-enable-dump", options=options@entry=0x7fffffffced0)
    at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/runtime/device/devprogram.cpp:2189
#18 0x00007ffff7cbe647 in amd::Program::build (this=0x55555567b4b0, devices=std::vector of length 1, capacity 1 = {...}, options=0x555555676da0 "-cl-internal-kernel  -fno-enable-dump", notifyFptr=notifyFptr@entry=0x0, 
    data=data@entry=0x0, optionChangable=<optimized out>) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/runtime/platform/program.cpp:519
#19 0x00007ffff7c98882 in amd::Device::BlitProgram::create (this=0x55555567b470, device=<optimized out>, device@entry=0x55555567a7f0, 
    extraKernels=extraKernels@entry=0x55555567b230 "\n extern void __amd_scheduler_rocm(__global void*); \n __kernel void scheduler(__global void* params) { __amd_scheduler_rocm(params); } \n\n __kernel void gwsInit(uint value) { unsigned int m0_backup, ne"..., extraOptions=extraOptions@entry=0x0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/bits/basic_string.h:2300
#20 0x00007ffff7cdc635 in roc::Device::create (this=0x55555567a7f0, sramEccEnabled=<optimized out>) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/runtime/device/rocm/rocdevice.cpp:692
#21 0x00007ffff7cdd716 in roc::Device::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/runtime/device/rocm/rocdevice.cpp:542
#22 0x00007ffff7c95dc1 in amd::Device::init () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/runtime/device/device.cpp:161
#23 0x00007ffff7cb7afe in amd::Runtime::init () at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/bits/atomic_base.h:212
#24 0x00007ffff7d0fa25 in ShouldLoadPlatform () at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/api/opencl/amdocl/cl_icd.cpp:205
#25 <lambda()>::operator() (__closure=<optimized out>) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/api/opencl/amdocl/cl_icd.cpp:255
#26 std::__invoke_impl<void, clIcdGetPlatformIDsKHR(cl_uint, _cl_platform_id**, cl_uint*)::<lambda()> > (__f=...) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/bits/invoke.h:60
#27 std::__invoke<clIcdGetPlatformIDsKHR(cl_uint, _cl_platform_id**, cl_uint*)::<lambda()> > (__fn=...) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/bits/invoke.h:95
#28 std::<lambda()>::operator() (this=<optimized out>) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/mutex:671
#29 std::<lambda()>::operator() (this=0x0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/mutex:676
#30 std::<lambda()>::_FUN(void) () at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/mutex:676
#31 0x00007ffff7d77f97 in __pthread_once_slow () from /lib64/libpthread.so.0
#32 0x00007ffff7d0fb3c in __gthread_once (__func=<optimized out>, __once=0x7ffff7d63020 <clIcdGetPlatformIDsKHR::initOnce>) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/x86_64-pc-linux-gnu/bits/gthr-default.h:700
#33 std::call_once<clIcdGetPlatformIDsKHR(cl_uint, _cl_platform_id**, cl_uint*)::<lambda()> > (__once=..., __f=...) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/mutex:683
#34 clIcdGetPlatformIDsKHR (num_entries=<optimized out>, platforms=0x0, num_platforms=0x7fffffffd66c) at /var/tmp/portage/dev-libs/rocm-opencl-runtime-2.6.0-r1/work/ROCm-OpenCL-Runtime-roc-2.6.0/api/opencl/amdocl/cl_icd.cpp:255
#35 0x00007ffff7f6b456 in _find_and_check_platforms (num_icds=1) at /var/tmp/portage/dev-libs/ocl-icd-2.2.12-r1/work/ocl-icd-2.2.12/ocl_icd_loader.c:445
#36 __initClIcd () at /var/tmp/portage/dev-libs/ocl-icd-2.2.12-r1/work/ocl-icd-2.2.12/ocl_icd_loader.c:652
#37 0x00007ffff7d77f97 in __pthread_once_slow () from /lib64/libpthread.so.0
#38 0x00007ffff7f6a470 in _initClIcd_real () at /var/tmp/portage/dev-libs/ocl-icd-2.2.12-r1/work/ocl-icd-2.2.12/ocl_icd_loader.c:694
#39 0x00007ffff7f6cd14 in _initClIcd () at /var/tmp/portage/dev-libs/ocl-icd-2.2.12-r1/work/ocl-icd-2.2.12/ocl_icd_loader.c:724
#40 clGetPlatformIDs (num_entries=0, platforms=0x0, num_platforms=0x7fffffffd7f0) at /var/tmp/portage/dev-libs/ocl-icd-2.2.12-r1/work/ocl-icd-2.2.12/ocl_icd_loader.c:846
#41 0x000055555555a410 in main (argc=<optimized out>, argv=<optimized out>) at src/clinfo.c:319
skeelyamd commented 5 years ago

I have reproduced this internally on Raven and the issue accounts for the faults on Hawaii as well. It does not account for the kernel errors being reported on Hawaii though. That remains a separate issue that the driver team is examining.

I'm testing a partial fix that restores functionality for correct code. The issue impacts fault handling and it may be that completing the fix will require some kernel involvement. If so that is likely to take a bit longer.

Thanks for the reports and traces.

Lucretia commented 5 years ago

I have reproduced this internally on Raven and the issue accounts for the faults on Hawaii as well. It does not account for the kernel errors being reported on Hawaii though. That remains a separate issue that the driver team is examining.

It'd be nice if you get keep testing Hawaii cards as there are a lot of people on older hardware who feel like they are being left behind with newer software, i.e. AMD keeps changing direction wrt to OpenCL screwing us over. It's not always possible for people to upgrade, I currently can't, so I'm stuck on Vishera and R9 390 for the forseeable future.

There are also serious kernel issues recently with regards to pci passthrough as well, rx580 (which doesn't have the reset bug), as in, not working in newer kernels. I'm on Gentoo and test the most recent kernels as they are released.

I'm testing a partial fix that restores functionality for correct code. The issue impacts fault handling and it may be that completing the fix will require some kernel involvement. If so that is likely to take a bit longer.

Thanks for the reports and traces.

if you can send patches here, we can test them too.

fxkamd commented 5 years ago

@Lucretia, this quick patch should fix your kernel oops: https://lists.freedesktop.org/archives/amd-gfx/2019-September/039702.html

Lucretia commented 5 years ago

@fxamd is that on top of the other two I applied to stop the original crash, or in place of?

skeelyamd commented 5 years ago

@Lucretia, expanded testing is something we are pushing for as well.

PR https://github.com/RadeonOpenCompute/ROCR-Runtime/pull/71 holds the patch to resolve the segfaults.

Lucretia commented 5 years ago

Doesn't answer the question though, does it?

skeelyamd commented 5 years ago

If you are referring to the question about kernel patches you will have to wait for @fxkamd to reply.

fxkamd commented 5 years ago

@fxamd is that on top of the other two I applied to stop the original crash, or in place of?

In place of. It looks like one of the patches is essentially the same fix I came up with. Shame that I didn't see that before I spent a day debugging it. The other patch is not needed. I tried something like that on the current code and it didn't work.

JHirte commented 5 years ago

@Lucretia, expanded testing is something we are pushing for as well.

PR #71 holds the patch to resolve the segfaults.

I can confirm that this fixed it on Raven.

Lucretia commented 5 years ago

@fxamd is that on top of the other two I applied to stop the original crash, or in place of?

In place of. It looks like one of the patches is essentially the same fix I came up with. Shame that I didn't see that before I spent a day debugging it. The other patch is not needed. I tried something like that on the current code and it didn't work.

I did point them out a number of times.

Anyway, I'll rebuild rocr-runtime with PR #71 and without those 2 kernel patches. Thanks. I'll report back with results.

Lucretia commented 5 years ago

@fxamd @skeelyamd Ok, just booted into a new kernel build (no patches on the kernel):

Linux rogue 5.2.11-gentoo #1 SMP PREEMPT Fri Sep 6 15:11:40 BST 2019 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

and ran clinfo, it paused and nothing was happening, pressed ctrl-c a few times, wouldn't die, then finally, dmesg -Hw showed:

[  +0.000005] Resetting wave fronts (nocpsch) on dev 00000000b5e360eb
[  +0.000015] BUG: kernel NULL pointer dereference, address: 0000000000000038
[  +0.000002] #PF: supervisor read access in kernel mode
[  +0.000001] #PF: error_code(0x0000) - not-present page
[  +0.000001] PGD 0 P4D 0 
[  +0.000002] Oops: 0000 [#1] PREEMPT SMP
[  +0.000003] CPU: 7 PID: 9875 Comm: clinfo Not tainted 5.2.11-gentoo #1
[  +0.000001] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99FX PRO R2.0, BIOS 2301 01/06/2014
[  +0.000046] RIP: 0010:amdgpu_ib_schedule+0x48/0x449 [amdgpu]
[  +0.000001] Code: 08 4c 89 44 24 28 45 85 ed 0f 84 0a 04 00 00 49 89 fe 48 89 cb 48 85 c9 74 1a 48 8b 81 90 00 00 00 48 89 44 24 18 48 8b 41 10 <48> 8b 40 38 48 89 04 24 eb 11 48 c7 44 24 18 00 00 00 00 48 c7 04
[  +0.000001] RSP: 0018:ffffabb644b0bad8 EFLAGS: 00010286
[  +0.000002] RAX: 0000000000000000 RBX: ffffa2a372c60000 RCX: ffffa2a372c60000
[  +0.000001] RDX: ffffa2a372c60210 RSI: 0000000000000001 RDI: ffffa2a373976c28
[  +0.000000] RBP: ffffa2a373970000 R08: ffffabb644b0bb50 R09: 0000000000000007
[  +0.000001] R10: 0000000000000000 R11: 0000000000000058 R12: 00000000ffffffea
[  +0.000001] R13: 0000000000000001 R14: ffffa2a373976c28 R15: ffffa2a16fd72000
[  +0.000001] FS:  00007f1876ffd700(0000) GS:ffffa2a47ebc0000(0000) knlGS:0000000000000000
[  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 0000000000000038 CR3: 00000005f9fc8000 CR4: 00000000000406e0
[  +0.000001] Call Trace:
[  +0.000032]  amdgpu_amdkfd_submit_ib+0xda/0x15a [amdgpu]
[  +0.000031]  deallocate_vmid+0x93/0xea [amdgpu]
[  +0.000031]  destroy_queue_nocpsch_locked+0x153/0x170 [amdgpu]
[  +0.000030]  process_termination_nocpsch+0x5b/0x10d [amdgpu]
[  +0.000030]  kfd_process_dequeue_from_device+0x25/0x2f [amdgpu]
[  +0.000031]  kfd_process_dequeue_from_all_devices+0x1d/0x26 [amdgpu]
[  +0.000035]  kfd_process_notifier_release+0x119/0x158 [amdgpu]
[  +0.000004]  __mmu_notifier_release+0x3c/0xb9
[  +0.000003]  exit_mmap+0x29/0x147
[  +0.000002]  ? ___cache_free+0x2c/0x16c
[  +0.000002]  ? __unqueue_futex+0x40/0x46
[  +0.000002]  ? _raw_spin_unlock+0xd/0x1e
[  +0.000002]  ? __khugepaged_exit+0x72/0x106
[  +0.000002]  mmput+0x38/0xcf
[  +0.000002]  do_exit+0x3bc/0x9b4
[  +0.000002]  do_group_exit+0x95/0x95
[  +0.000002]  get_signal+0x708/0x72e
[  +0.000003]  do_signal+0x2b/0x50e
[  +0.000002]  ? __se_sys_futex+0x12c/0x151
[  +0.000001]  exit_to_usermode_loop+0x38/0xa4
[  +0.000002]  do_syscall_64+0xcb/0xf6
[  +0.000001]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000002] RIP: 0033:0x7f18949aef6c
[  +0.000002] Code: Bad RIP value.
[  +0.000001] RSP: 002b:00007f1876ffcd70 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[  +0.000002] RAX: fffffffffffffe00 RBX: 00007f18917e59a8 RCX: 00007f18949aef6c
[  +0.000000] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f18917e59d0
[  +0.000001] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f186c000b20
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
[  +0.000001] R13: 00007f18917e5980 R14: 0000000000000000 R15: 00007f18917e59d0
[  +0.000001] Modules linked in: ebtable_filter ebtables ip6table_filter rfcomm nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo br_netfilter nfsd cmac bnep snd_seq_dummy snd_seq_oss snd_pcm_oss snd_mixer_oss 8021q binfmt_misc ip6table_nat ip6_tables twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx_x86_64 serpent_generic cast6_avx_x86_64 cast6_generic cast5_avx_x86_64 cast5_generic cast_common sha512_ssse3 sha256_ssse3 sha1_ssse3 hfsplus hfs cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_ondemand uinput tun bridge stp llc ipv6 crc_ccitt 9pnet_virtio 9p 9pnet vfio_pci vfio_virqfd snd_seq_midi snd_seq_midi_event snd_seq msr cpuid kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul aesni_intel crypto_simd cryptd glue_helper it87 hwmon_vid pcspkr btusb btrtl btbcm snd_hda_codec_realtek btintel input_leds joydev bluetooth hid_uclogic snd_hda_codec_generic snd_usb_audio ecdh_generic snd_usbmidi_lib snd_rawmidi ecc snd_hda_intel snd_seq_device
[  +0.000025]  rfkill hid_steam media uas fam15h_power k10temp usblp snd_hda_codec amdgpu snd_hwdep i2c_piix4 snd_hda_core snd_pcm amd_iommu_v2 gpu_sched snd_timer ttm snd pcc_cpufreq acpi_cpufreq button virtio_pci virtio_scsi virtio_blk virtio_console virtio_balloon aes_x86_64 sha512_generic libiscsi scsi_transport_iscsi vxlan ip6_udp_tunnel udp_tunnel macvlan virtio_net net_failover failover virtio_ring virtio pcnet32 mii fuse nfs lockd grace sunrpc jfs multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log hid_sony ff_memless hid_logitech hid_a4tech sl811_hcd uhci_hcd aic94xx libsas lpfc crc_t10dif qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc mptspi mptscsih mptbase atp870u dc395x qla1280 imm parport dmx3191d sym53c8xx gdth initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sg pdc_adma
[  +0.000032]  sata_inic162x sata_mv ata_piix sata_qstor sata_vsc sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw sata_sil24 sata_sil sata_promise pata_sl82c105 pata_via pata_jmicron pata_marvell pata_sis pata_netcell pata_pdc202xx_old pata_triflex pata_atiixp pata_opti pata_amd pata_ali pata_it8213 pata_pcmcia pcmcia pcmcia_core pata_ns87415 pata_ns87410 pata_serverworks pata_artop pata_it821x pata_optidma pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366 pata_cmd64x pata_efar pata_rz1000 pata_sil680 pata_radisys pata_pdc2027x pata_mpiix dm_mod dax led_class usb_storage crct10dif_common r8169 ahci libphy libahci libata
[  +0.000020] CR2: 0000000000000038
[  +0.000001] ---[ end trace 881fcf7587445750 ]---
[  +0.000026] RIP: 0010:amdgpu_ib_schedule+0x48/0x449 [amdgpu]
[  +0.000001] Code: 08 4c 89 44 24 28 45 85 ed 0f 84 0a 04 00 00 49 89 fe 48 89 cb 48 85 c9 74 1a 48 8b 81 90 00 00 00 48 89 44 24 18 48 8b 41 10 <48> 8b 40 38 48 89 04 24 eb 11 48 c7 44 24 18 00 00 00 00 48 c7 04
[  +0.000002] RSP: 0018:ffffabb644b0bad8 EFLAGS: 00010286
[  +0.000001] RAX: 0000000000000000 RBX: ffffa2a372c60000 RCX: ffffa2a372c60000
[  +0.000000] RDX: ffffa2a372c60210 RSI: 0000000000000001 RDI: ffffa2a373976c28
[  +0.000001] RBP: ffffa2a373970000 R08: ffffabb644b0bb50 R09: 0000000000000007
[  +0.000001] R10: 0000000000000000 R11: 0000000000000058 R12: 00000000ffffffea
[  +0.000001] R13: 0000000000000001 R14: ffffa2a373976c28 R15: ffffa2a16fd72000
[  +0.000001] FS:  00007f1876ffd700(0000) GS:ffffa2a47ebc0000(0000) knlGS:0000000000000000
[  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007f18949aef42 CR3: 00000005f9fc8000 CR4: 00000000000406e0
[  +0.000001] Fixing recursive fault but reboot is needed!

So, probably still need those two patches applied.

Now on latest kernel 5.3.0-rc6:

I just ran clinfo again, it literally just sits there doing nothing, so I ran another instance inside gdb and same, ctrl-c, bt:

Thread 1 "clinfo" received signal SIGINT, Interrupt.
0x00007ffff7e3f8b7 in sched_yield () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7e3f8b7 in sched_yield () from /lib64/libc.so.6
#1  0x00007ffff7aab496 in os::YieldThread() () from /usr/lib64/libhsa-runtime64.so.1
#2  0x00007ffff7ad2fa8 in amd::AqlQueue::ExecutePM4(unsigned int*, unsigned long) () from /usr/lib64/libhsa-runtime64.so.1
#3  0x00007ffff7aca16a in amd::GpuAgent::InvalidateCodeCaches() () from /usr/lib64/libhsa-runtime64.so.1
#4  0x00007ffff7ad699b in amd::LoaderContext::SegmentAlloc(amdgpu_hsa_elf_segment_t, hsa_agent_s, unsigned long, unsigned long, bool) () from /usr/lib64/libhsa-runtime64.so.1
#5  0x00007ffff7b38183 in amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(hsa_agent_s, amd::hsa::code::AmdHsaCode const*) () from /usr/lib64/libhsa-runtime64.so.1
#6  0x00007ffff7b37ff1 in amd::hsa::loader::ExecutableImpl::LoadSegments(hsa_agent_s, amd::hsa::code::AmdHsaCode const*, unsigned int) () from /usr/lib64/libhsa-runtime64.so.1
#7  0x00007ffff7b37a7e in amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, unsigned long, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#8  0x00007ffff7b36dc9 in amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#9  0x00007ffff7af7022 in HSA::hsa_executable_load_agent_code_object(hsa_executable_s, hsa_agent_s, hsa_code_object_reader_s, char const*, hsa_loaded_code_object_s*) () from /usr/lib64/libhsa-runtime64.so.1
#10 0x00007ffff7b325b2 in hsa_executable_load_agent_code_object () from /usr/lib64/libhsa-runtime64.so.1
#11 0x00007ffff7ca62b7 in ?? () from /usr/lib64/libamdocl64.so
#12 0x00007ffff7c80fb6 in ?? () from /usr/lib64/libamdocl64.so
#13 0x00007ffff7c7d077 in ?? () from /usr/lib64/libamdocl64.so
#14 0x00007ffff7c8f36f in ?? () from /usr/lib64/libamdocl64.so
#15 0x00007ffff7c6a24a in ?? () from /usr/lib64/libamdocl64.so
#16 0x00007ffff7caca7e in ?? () from /usr/lib64/libamdocl64.so
#17 0x00007ffff7cadb99 in ?? () from /usr/lib64/libamdocl64.so
#18 0x00007ffff7c67841 in ?? () from /usr/lib64/libamdocl64.so
#19 0x00007ffff7c88cd6 in ?? () from /usr/lib64/libamdocl64.so
#20 0x00007ffff7cde0c5 in ?? () from /usr/lib64/libamdocl64.so
#21 0x00007ffff7d44e57 in __pthread_once_slow () from /lib64/libpthread.so.0
#22 0x00007ffff7cde1dc in clIcdGetPlatformIDsKHR () from /usr/lib64/libamdocl64.so
#23 0x00007ffff7f2e15d in ?? () from /usr/lib64/OpenCL/vendors/ocl-icd/libOpenCL.so.1
#24 0x00007ffff7d44e57 in __pthread_once_slow () from /lib64/libpthread.so.0
#25 0x00007ffff7f2d428 in ?? () from /usr/lib64/OpenCL/vendors/ocl-icd/libOpenCL.so.1
#26 0x00007ffff7f30234 in clGetPlatformIDs () from /usr/lib64/OpenCL/vendors/ocl-icd/libOpenCL.so.1
#27 0x000055555555a420 in main (argc=<optimized out>, argv=<optimized out>) at src/clinfo.c:3190

So, yeah, looks like this seg fault is fixed and my issue is in the kernel in amdgpu_ib, only get the above oops when trying to kill clinfo.

fxkamd commented 5 years ago

You need the patch I pointed to yesterday. It's not submitted to any branch yet. It only exists as an email code review. I'm attaching it for your convenience.

0001-drm-amdgpu-Fix-KFD-related-kernel-oops-on-Hawaii.patch.txt

Lucretia commented 5 years ago

You need the patch I pointed to yesterday. It's not submitted to any branch yet. It only exists as an email code review. I'm attaching it for your convenience.

0001-drm-amdgpu-Fix-KFD-related-kernel-oops-on-Hawaii.patch.txt

Ah, missed that. Thanks, I'll apply it now.

Lucretia commented 5 years ago

Ok, tested on 5.2.9 and 5.3.0-rc6, clinfo now just hangs until it is interrupted, then dumps:

[Sep 6 17:53] cp queue preemption time out
[  +0.000009] Resetting wave fronts (nocpsch) on dev 000000004cbf5f94
[  +0.000044] ------------[ cut here ]------------
[  +0.000002] FW bug: No PASID in KFD interrupt
[  +0.000197] WARNING: CPU: 7 PID: 0 at drivers/gpu/drm/amd/amdgpu/../amdkfd/cik_event_interrupt.c:70 cik_event_interrupt_isr+0xee/0x130 [amdgpu]
[  +0.000001] Modules linked in: ebtable_filter ebtables ip6table_filter rfcomm nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo nfsd br_netfilter cmac bnep snd_seq_dummy snd_seq_oss snd_pcm_oss snd_mixer_oss 8021q binfmt_misc ip6table_nat ip6_tables twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx_x86_64 serpent_generic cast6_avx_x86_64 cast6_generic cast5_avx_x86_64 cast5_generic cast_common sha512_ssse3 sha256_ssse3 sha1_ssse3 hfsplus hfs cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_ondemand uinput tun bridge stp llc ipv6 crc_ccitt 9pnet_virtio 9p 9pnet kvm_amd vfio_pci vfio_virqfd kvm irqbypass crct10dif_pclmul crc32_pclmul snd_seq_midi snd_seq_midi_event snd_seq msr aesni_intel crypto_simd cryptd glue_helper cpuid pcspkr fam15h_power k10temp joydev input_leds hid_uclogic btusb btrtl btbcm btintel snd_hda_codec_realtek snd_usb_audio bluetooth snd_usbmidi_lib snd_rawmidi ecdh_generic snd_seq_device it87 ecc hwmon_vid hid_steam mc
[  +0.000037]  rfkill snd_hda_codec_generic r8152 snd_hda_intel uas usblp snd_hda_codec snd_hwdep amdgpu snd_hda_core i2c_piix4 snd_pcm amd_iommu_v2 snd_timer gpu_sched ttm snd acpi_cpufreq button virtio_pci virtio_scsi virtio_blk virtio_console virtio_balloon aes_x86_64 sha512_generic libiscsi scsi_transport_iscsi vxlan ip6_udp_tunnel udp_tunnel macvlan virtio_net net_failover failover virtio_ring virtio pcnet32 fuse nfs lockd grace sunrpc jfs multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log hid_sony ff_memless hid_logitech hid_a4tech sl811_hcd uhci_hcd aic94xx libsas lpfc crc_t10dif qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc mptspi mptscsih mptbase atp870u dc395x qla1280 imm parport dmx3191d sym53c8xx gdth initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sg pdc_adma
[  +0.000052]  sata_inic162x sata_mv ata_piix sata_qstor sata_vsc sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw sata_sil24 sata_sil sata_promise pata_sl82c105 pata_via pata_jmicron pata_marvell pata_sis pata_netcell pata_pdc202xx_old pata_triflex pata_atiixp pata_opti pata_amd pata_ali pata_it8213 pata_pcmcia pcmcia pcmcia_core pata_ns87415 pata_ns87410 pata_serverworks pata_artop pata_it821x pata_optidma pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366 pata_cmd64x pata_efar pata_rz1000 pata_sil680 pata_radisys pata_pdc2027x pata_mpiix led_class dm_mod dax usb_storage mii crct10dif_common ahci r8169 libahci libphy libata
[  +0.000030] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.3.0-rc6+ #2
[  +0.000002] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99FX PRO R2.0, BIOS 2301 01/06/2014
[  +0.000036] RIP: 0010:cik_event_interrupt_isr+0xee/0x130 [amdgpu]
[  +0.000002] Code: 94 00 00 00 72 5a c1 e9 10 75 28 44 8a 25 dd 7a 21 00 45 84 e4 75 17 48 c7 c7 44 16 de c0 c6 05 ca 7a 21 00 01 e8 a1 aa 3f d1 <0f> 0b eb 32 45 31 e4 eb 2d 8d 88 4b ff ff ff b0 01 83 f9 3a 77 13
[  +0.000002] RSP: 0018:ffffad9e001f4dd8 EFLAGS: 00010086
[  +0.000002] RAX: 0000000000000000 RBX: ffff898331dc5400 RCX: 0000000000000000
[  +0.000001] RDX: 0000000000010003 RSI: ffffffff9391f2e1 RDI: 00000000ffffffff
[  +0.000001] RBP: ffffad9e001f4df8 R08: 0000006ae466592e R09: 0000000000000021
[  +0.000002] R10: 0000000000000000 R11: 0000000000000044 R12: 0000000000000000
[  +0.000001] R13: ffff898331dc55e0 R14: ffffad9e001f4e10 R15: 0000000000000000
[  +0.000002] FS:  0000000000000000(0000) GS:ffff89843ebc0000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007fd6b4ecf000 CR3: 00000005de0a8000 CR4: 00000000000406e0
[  +0.000002] Call Trace:
[  +0.000003]  <IRQ>
[  +0.000035]  kgd2kfd_interrupt+0x9b/0xfc [amdgpu]
[  +0.000034]  amdgpu_irq_dispatch+0x17b/0x199 [amdgpu]
[  +0.000033]  amdgpu_ih_process+0x85/0xe0 [amdgpu]
[  +0.000034]  amdgpu_irq_handler+0x16/0x34 [amdgpu]
[  +0.000004]  __handle_irq_event_percpu+0x90/0x17e
[  +0.000002]  handle_irq_event_percpu+0x2c/0x6f
[  +0.000003]  handle_irq_event+0x2f/0x4c
[  +0.000003]  handle_edge_irq+0x105/0x122
[  +0.000003]  handle_irq+0x19/0x1c
[  +0.000004]  do_IRQ+0x62/0x108
[  +0.000002]  common_interrupt+0xf/0xf
[  +0.000001]  </IRQ>
[  +0.000004] RIP: 0010:cpuidle_enter_state+0x227/0x333
[  +0.000002] Code: 6d 75 05 e8 bd 53 8e ff 31 ff e8 df 83 95 ff 45 84 ed 74 12 9c 58 0f ba e0 09 73 03 0f 0b fa 31 ff e8 bf b4 99 ff fb 45 85 e4 <0f> 88 ed 00 00 00 49 63 cc be ff ff ff 7f 48 ba ff ff ff ff f3 01
[  +0.000001] RSP: 0018:ffffad9e000b7e78 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffd9
[  +0.000003] RAX: ffff89843ebc0000 RBX: ffff898335f07400 RCX: 000000000000001f
[  +0.000001] RDX: 0000000000000000 RSI: 000000001fe466fc RDI: 0000000000000000
[  +0.000001] RBP: ffffffff934c15a0 R08: 0000006ae4660d5d R09: 000000007fffffff
[  +0.000002] R10: 071c71c71c71c71c R11: 0000000000000020 R12: 0000000000000002
[  +0.000001] R13: 0000000000000000 R14: ffffffff934c1678 R15: 0000000000000000
[  +0.000003]  ? cpuidle_enter_state+0x20c/0x333
[  +0.000002]  ? menu_select+0x423/0x4aa
[  +0.000002]  cpuidle_enter+0x25/0x31
[  +0.000002]  do_idle+0x17a/0x1ed
[  +0.000002]  cpu_startup_entry+0x18/0x1a
[  +0.000008]  start_secondary+0x151/0x16c
[  +0.000002]  secondary_startup_64+0xa4/0xb0
[  +0.000003] ---[ end trace 6ec999382ed68a9b ]---
PhillCli commented 5 years ago

@Lucretia, expanded testing is something we are pushing for as well.

PR #71 holds the patch to resolve the segfaults.

after re-compiling ROCR-Runtime with mentioned PR clinfo completes with

clinfo
LoadLib(libhsa-ext-finalize64.so.1) failed: libhsa-ext-finalize64.so.1: cannot open shared object file: No such file or directory
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (2949.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx902+xnack
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 2.0 
  Driver Version                                  2949.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx
  Device Topology (AMD)                           PCI-E, 05:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               11
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1200MHz
  Graphics IP (AMD)                               9.2
  Device Partition                                (core)
    Max number of sub-devices                     11
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                2 / 2       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              15750221824 (14.67GiB)
  Global free memory (AMD)                        15381076 (14.67GiB)
  Global memory channels (AMD)                    2
  Global memory banks per channel (AMD)           4
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           13387688550 (12.47GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   Yes
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    13387688550 (12.47GiB)
  Preferred total size of global vars             15750221824 (14.67GiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             5592
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    16
  Max pipe packet size                            502786662 (479.5MiB)
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Local memory syze per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max number of constant args                     8
  Max constant buffer size                        13387688550 (12.47GiB)
  Preferred constant buffer size (AMD)            16384 (16KiB)
  Max size of kernel argument                     1024
  Queue properties (on host)                      
    Out-of-order execution                        No
    Profiling                                     Yes
  Queue properties (on device)                    
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                262144 (256KiB)
    Max size                                      8388608 (8MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    Yes
  Number of P2P devices (AMD)                     0
  P2P devices (AMD)                               
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        0ns (Thu Jan  1 01:00:00 1970)
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  No
    Number of async queues (AMD)                  8
    Max real-time compute queues (AMD)            8
    Max real-time compute units (AMD)             11
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [AMD]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx902+xnack
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx902+xnack
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx902+xnack

but after running some OpenCL kernels I get this and then clinfo becomes broken

LoadLib(libhsa-ext-finalize64.so.1) failed: libhsa-ext-finalize64.so.1: cannot open shared object file: No such file or directory
HSA exception: Queue create failed at hsaKmtCreateQueue

running 5.3.0 kernel on ubuntu 18.04.3, dmesg says

[pon wrz 16 18:56:41 2019] clblast_client_[10393]: segfault at 0 ip 00007fe775532bb7 sp 00007ffd38ace180 error 4 in libamdocl64.so[7fe773924000+4543000]
[pon wrz 16 18:56:41 2019] Code: 01 c6 44 8b 0e 44 39 ca 0f 85 4a 02 00 00 48 c1 e1 05 4c 01 c1 48 39 ce 0f 84 31 02 00 00 48 8b 46 08 4c 89 f7 be 71 10 00 00 <8b> 18 48 63 68 04 e8 2e 1c 04 00 89 c1 48 8b 7c 24 38 48 8d 94 24
[pon wrz 16 18:56:41 2019] Evicting PASID 32775 queues
[pon wrz 16 18:56:47 2019] ------------[ cut here ]------------
[pon wrz 16 18:56:47 2019] Runlist IB overflow
[pon wrz 16 18:56:47 2019] WARNING: CPU: 7 PID: 10430 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_packet_manager.c:35 pm_create_runlist_ib+0x46f/0x500 [amdgpu]
[pon wrz 16 18:56:47 2019] Modules linked in: btrfs xor zstd_compress raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs veth xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat xt_addrtype iptable_filter bpfilter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc ccm cmac rfcomm overlay bnep binfmt_misc nls_iso8859_1 edac_mce_amd kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel joydev input_leds aes_x86_64 crypto_simd snd_hda_codec_conexant cryptd amdgpu glue_helper serio_raw snd_hda_codec_generic uvcvideo amd_iommu_v2 snd_hda_codec_hdmi snd_seq_midi rtwpci btusb gpu_sched snd_seq_midi_event rtw88 videobuf2_vmalloc btrtl snd_rawmidi ttm videobuf2_memops btbcm drm_kms_helper videobuf2_v4l2 btintel thinkpad_acpi mac80211 snd_hda_intel videobuf2_common snd_hda_codec nvram wmi_bmof k10temp ledtrig_audio snd_hda_core bluetooth snd_seq videodev snd_hwdep drm snd_pcm mc cfg80211 ecdh_generic
[pon wrz 16 18:56:47 2019]  ecc snd_pci_acp3x i2c_algo_bit fb_sys_fops rtsx_pci_ms snd_seq_device syscopyarea sysfillrect ucsi_acpi sysimgblt snd_timer libarc4 memstick typec_ucsi snd typec soundcore mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc psmouse nvme ahci i2c_piix4 libahci r8169 rtsx_pci realtek nvme_core wmi video i2c_scmi
[pon wrz 16 18:56:47 2019] CPU: 7 PID: 10430 Comm: clblast_client_ Not tainted 5.3.0-050300-generic #201909152230
[pon wrz 16 18:56:47 2019] Hardware name: LENOVO 20NE000JPB/20NE000JPB, BIOS R11ET25W (1.05 ) 04/04/2019
[pon wrz 16 18:56:47 2019] RIP: 0010:pm_create_runlist_ib+0x46f/0x500 [amdgpu]
[pon wrz 16 18:56:47 2019] Code: 1f fc ff ff 49 8b 45 48 44 0f af 70 44 44 0f af 78 38 43 8d 04 3e 89 45 bc e9 27 fc ff ff 48 c7 c7 ce 4d c3 c0 e8 8c a2 68 fb <0f> 0b e9 17 fd ff ff 41 8b 55 00 48 c7 c6 e2 4d c3 c0 48 c7 c7 c0
[pon wrz 16 18:56:47 2019] RSP: 0018:ffffbb758834faf0 EFLAGS: 00010286
[pon wrz 16 18:56:47 2019] RAX: 0000000000000000 RBX: 0000000000000010 RCX: 0000000000000000
[pon wrz 16 18:56:47 2019] RDX: 0000000000000003 RSI: ffffffffbdb80f73 RDI: 0000000000000246
[pon wrz 16 18:56:47 2019] RBP: ffffbb758834fb58 R08: ffffffffbdb80f60 R09: 0000000000000013
[pon wrz 16 18:56:47 2019] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8fa4c609e420
[pon wrz 16 18:56:47 2019] R13: ffff8fa4d7c430d0 R14: 0000000000000010 R15: ffff8fa4d7c430e0
[pon wrz 16 18:56:47 2019] FS:  00007f96b1966340(0000) GS:ffff8fa4e09c0000(0000) knlGS:0000000000000000
[pon wrz 16 18:56:47 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[pon wrz 16 18:56:47 2019] CR2: 00007f96b1979038 CR3: 000000079b294000 CR4: 00000000003406e0
[pon wrz 16 18:56:47 2019] Call Trace:
[pon wrz 16 18:56:47 2019]  ? ww_mutex_unlock+0x26/0x30
[pon wrz 16 18:56:47 2019]  pm_send_runlist+0x32/0x120 [amdgpu]
[pon wrz 16 18:56:47 2019]  map_queues_cpsch+0x42/0x80 [amdgpu]
[pon wrz 16 18:56:47 2019]  execute_queues_cpsch.constprop.0+0x3a/0x50 [amdgpu]
[pon wrz 16 18:56:47 2019]  create_queue_cpsch+0x336/0x340 [amdgpu]
[pon wrz 16 18:56:47 2019]  pqm_create_queue+0x181/0x500 [amdgpu]
[pon wrz 16 18:56:47 2019]  kfd_ioctl_create_queue+0xc2/0x2b0 [amdgpu]
[pon wrz 16 18:56:47 2019]  kfd_ioctl+0x10e/0x410 [amdgpu]
[pon wrz 16 18:56:47 2019]  ? kfd_ioctl_dbg_address_watch+0x190/0x190 [amdgpu]
[pon wrz 16 18:56:47 2019]  do_vfs_ioctl+0x407/0x670
[pon wrz 16 18:56:47 2019]  ksys_ioctl+0x67/0x90
[pon wrz 16 18:56:47 2019]  __x64_sys_ioctl+0x1a/0x20
[pon wrz 16 18:56:47 2019]  do_syscall_64+0x5a/0x130
[pon wrz 16 18:56:47 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[pon wrz 16 18:56:47 2019] RIP: 0033:0x7f96b03235d7
[pon wrz 16 18:56:47 2019] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
[pon wrz 16 18:56:47 2019] RSP: 002b:00007ffde4191708 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[pon wrz 16 18:56:47 2019] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f96b03235d7
[pon wrz 16 18:56:47 2019] RDX: 00007ffde4191780 RSI: 00000000c0584b02 RDI: 0000000000000005
[pon wrz 16 18:56:47 2019] RBP: 00007ffde4191780 R08: 00007f96b1923000 R09: 0000000000040000
[pon wrz 16 18:56:47 2019] R10: 0000000000000022 R11: 0000000000000246 R12: 00000000c0584b02
[pon wrz 16 18:56:47 2019] R13: 0000000000000005 R14: 00007f96b1979000 R15: 0000000000000064
[pon wrz 16 18:56:47 2019] ---[ end trace 455479d1afd8a161 ]---
[pon wrz 16 18:56:47 2019] ------------[ cut here ]------------
[pon wrz 16 18:56:47 2019] Runlist IB overflow
[pon wrz 16 18:56:47 2019] WARNING: CPU: 7 PID: 10430 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_packet_manager.c:35 pm_create_runlist_ib+0x3b9/0x500 [amdgpu]
[pon wrz 16 18:56:47 2019] Modules linked in: btrfs xor zstd_compress raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs veth xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat xt_addrtype iptable_filter bpfilter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc ccm cmac rfcomm overlay bnep binfmt_misc nls_iso8859_1 edac_mce_amd kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel joydev input_leds aes_x86_64 crypto_simd snd_hda_codec_conexant cryptd amdgpu glue_helper serio_raw snd_hda_codec_generic uvcvideo amd_iommu_v2 snd_hda_codec_hdmi snd_seq_midi rtwpci btusb gpu_sched snd_seq_midi_event rtw88 videobuf2_vmalloc btrtl snd_rawmidi ttm videobuf2_memops btbcm drm_kms_helper videobuf2_v4l2 btintel thinkpad_acpi mac80211 snd_hda_intel videobuf2_common snd_hda_codec nvram wmi_bmof k10temp ledtrig_audio snd_hda_core bluetooth snd_seq videodev snd_hwdep drm snd_pcm mc cfg80211 ecdh_generic
[pon wrz 16 18:56:47 2019]  ecc snd_pci_acp3x i2c_algo_bit fb_sys_fops rtsx_pci_ms snd_seq_device syscopyarea sysfillrect ucsi_acpi sysimgblt snd_timer libarc4 memstick typec_ucsi snd typec soundcore mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc psmouse nvme ahci i2c_piix4 libahci r8169 rtsx_pci realtek nvme_core wmi video i2c_scmi
[pon wrz 16 18:56:47 2019] CPU: 7 PID: 10430 Comm: clblast_client_ Tainted: G        W         5.3.0-050300-generic #201909152230
[pon wrz 16 18:56:47 2019] Hardware name: LENOVO 20NE000JPB/20NE000JPB, BIOS R11ET25W (1.05 ) 04/04/2019
[pon wrz 16 18:56:47 2019] RIP: 0010:pm_create_runlist_ib+0x3b9/0x500 [amdgpu]
[pon wrz 16 18:56:47 2019] Code: 8b 47 48 8b 40 44 c1 e8 02 44 8d 34 03 4c 89 f3 4a 8d 14 b5 00 00 00 00 48 39 55 c8 73 a2 48 c7 c7 ce 4d c3 c0 e8 42 a3 68 fb <0f> 0b eb 92 41 0f b6 4c 24 38 41 8b 95 c8 00 00 00 48 c7 c6 78 74
[pon wrz 16 18:56:47 2019] RSP: 0018:ffffbb758834faf0 EFLAGS: 00010286
[pon wrz 16 18:56:47 2019] RAX: 0000000000000000 RBX: 0000000000000017 RCX: 0000000000000000
[pon wrz 16 18:56:47 2019] RDX: 0000000000000003 RSI: ffffffffbdb80f73 RDI: 0000000000000246
[pon wrz 16 18:56:47 2019] RBP: ffffbb758834fb58 R08: ffffffffbdb80f60 R09: 0000000000000013
[pon wrz 16 18:56:47 2019] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8fa4c609e420
[pon wrz 16 18:56:47 2019] R13: ffff8fa3b1181d00 R14: 0000000000000017 R15: ffff8fa4d7c430d0
[pon wrz 16 18:56:47 2019] FS:  00007f96b1966340(0000) GS:ffff8fa4e09c0000(0000) knlGS:0000000000000000
[pon wrz 16 18:56:47 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[pon wrz 16 18:56:47 2019] CR2: 00007f96b1979038 CR3: 000000079b294000 CR4: 00000000003406e0
[pon wrz 16 18:56:47 2019] Call Trace:
[pon wrz 16 18:56:47 2019]  pm_send_runlist+0x32/0x120 [amdgpu]
[pon wrz 16 18:56:47 2019]  map_queues_cpsch+0x42/0x80 [amdgpu]
[pon wrz 16 18:56:47 2019]  execute_queues_cpsch.constprop.0+0x3a/0x50 [amdgpu]
[pon wrz 16 18:56:47 2019]  create_queue_cpsch+0x336/0x340 [amdgpu]
[pon wrz 16 18:56:47 2019]  pqm_create_queue+0x181/0x500 [amdgpu]
[pon wrz 16 18:56:47 2019]  kfd_ioctl_create_queue+0xc2/0x2b0 [amdgpu]
[pon wrz 16 18:56:47 2019]  kfd_ioctl+0x10e/0x410 [amdgpu]
[pon wrz 16 18:56:47 2019]  ? kfd_ioctl_dbg_address_watch+0x190/0x190 [amdgpu]
[pon wrz 16 18:56:47 2019]  do_vfs_ioctl+0x407/0x670
[pon wrz 16 18:56:47 2019]  ksys_ioctl+0x67/0x90
[pon wrz 16 18:56:47 2019]  __x64_sys_ioctl+0x1a/0x20
[pon wrz 16 18:56:47 2019]  do_syscall_64+0x5a/0x130
[pon wrz 16 18:56:47 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[pon wrz 16 18:56:47 2019] RIP: 0033:0x7f96b03235d7
[pon wrz 16 18:56:47 2019] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
[pon wrz 16 18:56:47 2019] RSP: 002b:00007ffde4191708 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[pon wrz 16 18:56:47 2019] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f96b03235d7
[pon wrz 16 18:56:47 2019] RDX: 00007ffde4191780 RSI: 00000000c0584b02 RDI: 0000000000000005
[pon wrz 16 18:56:47 2019] RBP: 00007ffde4191780 R08: 00007f96b1923000 R09: 0000000000040000
[pon wrz 16 18:56:47 2019] R10: 0000000000000022 R11: 0000000000000246 R12: 00000000c0584b02
[pon wrz 16 18:56:47 2019] R13: 0000000000000005 R14: 00007f96b1979000 R15: 0000000000000064
[pon wrz 16 18:56:47 2019] ---[ end trace 455479d1afd8a162 ]---
[pon wrz 16 18:56:47 2019] Runlist is getting oversubscribed. Expect reduced ROCm performance.
[pon wrz 16 19:01:11 2019] qcm fence wait loop timeout expired
[pon wrz 16 19:01:11 2019] The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[pon wrz 16 19:01:11 2019] [drm] GPU recovery disabled.
[pon wrz 16 19:01:20 2019] Pasid 32775 DQM create queue 0 failed. ret -5
fxkamd commented 5 years ago

@PhillCli it looks like you're still getting a segfault in user mode.

And it looks like there is a problem with runlist handling resulting the Hardware scheduler hanging. Maybe a race condition while dealing with the segfault from user mode. The messages about oversubscription and the fact that the problem occurs after running a few tests in a row indicates that there maybe come processes hanging around in the background and leaking queue resources. That doesn't excuse the kernel warnings but points at some other problems that should be investigated.

You can see KFDs view of allocated queues and processes in /sys/kernel/debug/kfd/mqds. It should be empty after bootup and after finishing a test with no more compute processes running.

Lucretia commented 5 years ago

@fxkamd any idea what's going on with my setup?

fxkamd commented 5 years ago

@Lucretia no, I'm not sure what's happening. I also see a CP hang at the start of your log snippet but not enough info to say why that's happening.

The no-PASID error is weird. The driver has a workaround for handling that for VM faults. Either the workaround is broken or there is a different type of interrupt also missing a PASID. Would need to add more instrumentation in cik_event_interrupt_isr to find out more.

Lucretia commented 5 years ago

@fxkamd if you can post a patch with what you need, I can run it and post the results. I'm on 5.2.11-gentoo kernel, can try latest rc and 5.2.9-gentoo.

Lucretia commented 5 years ago

@fxkamd ok, according to the source (cik_event_interrupt_isr):

/* If there is no valid PASID, it's likely a firmware bug */
        pasid = (ihre->ring_id & 0xffff0000) >> 16;
        if (WARN_ONCE(pasid == 0, "FW bug: No PASID in KFD interrupt"))
                return 0;

I just grabbed the latest gentoo firmware and nothing seems to have changed on the amdgpu side of things and they come from git.

fxkamd commented 5 years ago

@Lucretia I'd go after the CP hang first because the interrupt problem may be a consequence and because fixing the interrupt handling would still leave you with a hanging GPU. For that I don't need instrumentation but probably more of the log before what you posted and if that doesn't give me any more clues, steps to reproduce.

Also, can you post your GPU firmware versions? There should be a firmware version file in /sys/kernel/debug/dri/?/

Lucretia commented 5 years ago

What is the "CP?" Co-processor? Command Processor?

# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info 
VCE feature version: 0, firmware version: 0x320a0200
UVD feature version: 0, firmware version: 0x01400900
MC feature version: 0, firmware version: 0x00c79110
ME feature version: 29, firmware version: 0x000000bb
PFP feature version: 29, firmware version: 0x000000e5
CE feature version: 29, firmware version: 0x0000007a
RLC feature version: 1, firmware version: 0x00000011
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 29, firmware version: 0x000001a5
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
TA XGMI feature version: 0, firmware version: 0x00000000
TA RAS feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x000f0600
SDMA0 feature version: 9, firmware version: 0x0000004c
SDMA1 feature version: 0, firmware version: 0x0000004c
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
VBIOS version: 113-2E32430-X4J

# cat /sys/kernel/debug/dri/1/amdgpu_firmware_info 
VCE feature version: 0, firmware version: 0x351a0300
UVD feature version: 0, firmware version: 0x01821000
MC feature version: 0, firmware version: 0x03b4dc40
ME feature version: 49, firmware version: 0x000000a7
PFP feature version: 49, firmware version: 0x000000fe
CE feature version: 49, firmware version: 0x0000008c
RLC feature version: 1, firmware version: 0x0000011e
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 49, firmware version: 0x000002da
MEC2 feature version: 49, firmware version: 0x000002da
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
TA XGMI feature version: 0, firmware version: 0x00000000
TA RAS feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x00171100
SDMA0 feature version: 31, firmware version: 0x0000003a
SDMA1 feature version: 0, firmware version: 0x0000003a
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
VBIOS version: 113-1E3870U-O49

# lspci

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] (rev 80)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii HDMI Audio [Radeon R9 290/290X / 390/390X]
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]

The RX580 is for VFIO GPU passthrough, it's not used on the Linux side, and wouldn't work anyway as there is no PCIe atomics on this machine.

fxkamd commented 5 years ago

The driver is showing firmware for both GPUs. So both GPUs have been initialized by the driver. That may problematic for pass-through. Not sure if there is a good way to hide one of the GPUs from the kernel driver to leave it pristine for pass-through to another OS.

The firmware versions I'm looking for are MEC (micro engine compute) and MEC2. That's the compute part of the CP (command processor). I'll compare that with my test system later.

Lucretia commented 5 years ago

I can turn off the RX 580 with a script, it just boots initialised.

I've just diffed the versions I have from git with the versions from the latest amdgpu-pro drivers and they appear to be the same.

fxkamd commented 5 years ago

@Lucretia I think the MEC firmware version on your system doesn't support the AQL packet format. You have MEC FW version 0x1a5, I have 0x1a7. I attached the version that's included in ROCm-releases. You'll need to gunzip it, replace /lib/firmware/amdgpu/hawaii_mec.bin, update your initrd and reboot.

I'm trying to remember why this updated firmware never got published to linux-firmware or included in our amdgpu-pro releasese. There may have been some graphics regression ...

hawaii_mec.bin.gz

Lucretia commented 5 years ago

@fxkamd Ok, just tried it, success:

$ clinfo 
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP.internal (2949.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_object_metadata cl_amd_event_callback 
  Platform Max metadata object keys (AMD)         8
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx701
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2 
  Driver Version                                  2949.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         Hawaii PRO [Radeon R9 290/390]
  Device Topology (AMD)                           PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               40
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1040MHz
  Graphics IP (AMD)                               7.1
  Device Partition                                (core)
    Max number of sub-devices                     40
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                2 / 2       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              8589934592 (8GiB)
  Global free memory (AMD)                        8386560 (7.998GiB)
  Global memory channels (AMD)                    16
  Global memory banks per channel (AMD)           4
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           7301444403 (6.8GiB)
  Unified memory for Host and Device              No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   No
    Base address alignment for 2D image buffers   0 bytes
    Pitch alignment for 2D image buffers          0 pixels
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Local memory syze per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max number of constant args                     8
  Max constant buffer size                        7301444403 (6.8GiB)
  Preferred constant buffer size (AMD)            16384 (16KiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Number of P2P devices (AMD)                     0
  P2P devices (AMD)                               (n/a)
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        0ns (Thu Jan  1 01:00:00 1970)
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  No
    Number of async queues (AMD)                  8
    Max real-time compute queues (AMD)            8
    Max real-time compute units (AMD)             40
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  AMD Accelerated Parallel Processing
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [AMD]
  clCreateContext(NULL, ...) [default]            Success [AMD]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx701
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx701
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx701

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.12
  ICD loader Profile                              OpenCL 2.2
skeelyamd commented 5 years ago

I believe the initial issue has been addressed now and it seems that the kernel issues have also been addressed. PR 71 will be merged as soon as I know it will be in the next binary release. I'm closing the issue, please let me know if it is still needed.

drwetter commented 5 years ago

Hi @skeelyamd ,

not sure whether you meant the recent update but just wanted to let you know there's still the same issue:

prompt:~ 0# gdb clinfo
GNU gdb (GDB; openSUSE Tumbleweed) 8.3
[..]
(gdb) r
[..]
[New Thread 0x7ffee9cc7700 (LWP 14519)]

Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
0x00007ffff7256605 in std::_Function_handler<core::Queue* (), amd::GpuAgent::InitDma()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
prompt:~ 0#  rpm -qf /opt/rocm/hsa/lib/libhsa-runtime64.so.1
hsa-rocr-dev-1.1.9_112_g3d9d98f5-1.x86_64
prompt:~ 0# uname -a
Linux REDACTED 5.2.14-1-default #1 SMP Tue Sep 10 10:52:01 UTC 2019 (374b0ae) x86_64 x86_64 x86_64 GNU/Linux
prompt:~ 0# 
skeelyamd commented 5 years ago

The fix will be in the next binary release, ROCm 2.9. It's also been merged to the master branch so if you build from there you should see the fix.

drwetter commented 5 years ago

ok, thx

Lucretia commented 5 years ago

The fix will be in the next binary release, ROCm 2.9. It's also been merged to the master branch so if you build from there you should see the fix.

Merged into which master branch? ROCR_Runtime? There's been nothing merged into the github tree since 27/09/2019

skeelyamd commented 5 years ago

ROCr_Runtime, yes. The merge is https://github.com/RadeonOpenCompute/ROCR-Runtime/commit/f446e05ed4c86c57d43d5cf8cb3a07bf8b71d8d0

drwetter commented 4 years ago

I just updated my binaries/libraries to the ones from Nov 20. It worked before. Unfortunately now I have the same problem.

ion:~ 0# gdb clinfo 
GNU gdb (GDB; openSUSE Tumbleweed) 8.3.1
[..]
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from clinfo...
Reading symbols from /usr/lib/debug/usr/bin/clinfo-2.2.18.04.06-1.6.x86_64.debug...
(gdb) r
Starting program: /usr/bin/clinfo 
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.30-1.2.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff71ae700 (LWP 16778)]

Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
0x00007fffefd0e068 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1
(gdb) 

Also other user programs just segfault.

prompt:~ 0# rpm -qf /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 
hsa-ext-rocr-dev-1.1.9_139_g0d1ca36b-1.x86_64
prompt:~ 0# 
PhillCli commented 4 years ago

I just updated my binaries/libraries to the ones from Nov 20. It worked before. Unfortunately now I have the same problem.

ion:~ 0# gdb clinfo 
GNU gdb (GDB; openSUSE Tumbleweed) 8.3.1
[..]
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from clinfo...
Reading symbols from /usr/lib/debug/usr/bin/clinfo-2.2.18.04.06-1.6.x86_64.debug...
(gdb) r
Starting program: /usr/bin/clinfo 
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.30-1.2.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff71ae700 (LWP 16778)]

Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
0x00007fffefd0e068 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1
(gdb) 

Also other user programs just segfault.

prompt:~ 0# rpm -qf /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 
hsa-ext-rocr-dev-1.1.9_139_g0d1ca36b-1.x86_64
prompt:~ 0# 

can confirm I experience the same issue with 2.10 version

drwetter commented 4 years ago

Can you guys please keep previous versions of the RPM @ http://repo.radeon.com/rocm/yum/rpm/ ??

And a bit of testing for other environments would be appreciated.

I am using the software, that was one of the reasons I bought an AMD machine. And it's the second time I cannot use the software, don't know for how long?

pqyptixa commented 4 years ago

Any news on this? Guess https://github.com/RadeonOpenCompute/ROCm/issues/962 could be related, too.