ROCm / ROCm-CompilerSupport

The compiler support repository provides various Lightning Compiler related services.
45 stars 31 forks source link

libamd_comgr.so.2.4 segfault with amd_comgr_do_action (training a model stops at epoch 1) #47

Closed 7910f6ba7ee4 closed 7 months ago

7910f6ba7ee4 commented 2 years ago

Hello, I've just recently installed ROCm 5.2.0 on arch with the rocm-arch repository. Everything has worked up to this point (no initial errors, tensorflow works, clinfo, rocm-smi, and rocminfo produce outputs).

When trying to train a network, python stops at epoch 1 for a few minutes before ending with: Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

The specific output when running the program (before segfault):

2022-07-09 15:51:14.487552: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.536717: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.536765: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.536968: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-09 15:51:14.538588: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.538746: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.538794: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.538883: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.538919: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.538953: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-09 15:51:14.538973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15868 MB memory:  -> device: 0, name: AMD Radeon RX 6950 XT, pci bus id: 0000:03:00.0
2022-07-09 15:51:14.918116: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-07-09 15:51:14.920464: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-07-09 15:51:14.922409: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
Epoch 1/20
2022-07-09 15:51:15.167784: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.

dmesg errors (each segfault is an attempt):

[   39.531376] python[3629]: segfault at 7f0c73ffe000 ip 00007f13ab7dd849 sp 00007f0cbaff88a0 error 4 in libamd_comgr.so.2.4[7f13a8bb2000+4d8a000]
[ 1405.871611] python[6057]: segfault at 7f787bff4000 ip 00007f7fd37dd84e sp 00007f7d2085e8a0 error 4 in libamd_comgr.so.2.4[7f7fd0bb2000+4d8a000]
[ 2553.650927] python[7475]: segfault at 7f5a9bff4000 ip 00007f61db7dd84e sp 00007f5acaff88a0 error 4 in libamd_comgr.so.2.4[7f61d8bb2000+4d8a000]
[ 4000.239291] python[8074]: segfault at 7fb64bff4000 ip 00007fbd8b7dd860 sp 00007fb67b7f98a0 error 4 in libamd_comgr.so.2.4[7fbd88bb2000+4d8a000]
[ 4063.327011] python[8284]: segfault at 7f3f3fff4000 ip 00007f46877dd84e sp 00007f43a0ff48a0 error 4 in libamd_comgr.so.2.4[7f4684bb2000+4d8a000]
[ 4068.349537] python[8426]: segfault at 7fabbbff4000 ip 00007fb1f37dd84e sp 00007fac237f98a0 error 4 in libamd_comgr.so.2.4[7fb1f0bb2000+4d8a000]
[ 4206.906834] python[8761]: segfault at 7f4f77ff4000 ip 00007f56af7dd860 sp 00007f4f9fffa8a0 error 4 in libamd_comgr.so.2.4[7f56acbb2000+4d8a000]
[ 4212.201227] python[8903]: segfault at 7f982fff4000 ip 00007f9f6f7dd860 sp 00007f986d7f58a0 error 4 in libamd_comgr.so.2.4[7f9f6cbb2000+4d8a000]
[ 4219.067074] python[9038]: segfault at 7f4073ff4000 ip 00007f46b37dd860 sp 00007f40a6ff88a0 error 4 in libamd_comgr.so.2.4[7f46b0bb2000+4d8a000]
[ 4225.052739] python[9184]: segfault at 7f3aebff4000 ip 00007f42237dd860 sp 00007f3b3cff48a0 error 4 in libamd_comgr.so.2.4[7f4220bb2000+4d8a000]
[ 4230.810220] python[9321]: segfault at 7f539fffe000 ip 00007f5adb7dd849 sp 00007f53caff88a0 error 4 in libamd_comgr.so.2.4[7f5ad8bb2000+4d8a000]

possibly associated dmesg stacktrace:

[  419.962259] ------------[ cut here ]------------
[  419.962261] WARNING: CPU: 6 PID: 264 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2e9/0x310 [ttm]
[  419.962267] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cmac algif_hash algif_skcipher af_alg ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel intel_rapl_msr nft_chain_nat intel_rapl_common soundwire_generic_allocation bnep nf_nat soundwire_cadence snd_sof_intel_hda nf_conntrack snd_sof_pci btusb snd_sof_xtensa_dsp btrtl snd_sof btbcm nf_defrag_ipv6 btintel nf_defrag_ipv4 soundwire_bus bluetooth snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_sst_ipc nft_counter ecdh_generic iwlmvm snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_hda_codec_realtek intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_generic snd_compress coretemp ac97_bus ledtrig_audio snd_hda_codec_hdmi snd_pcm_dmaengine kvm_intel snd_hda_intel mousedev joydev nf_tables mac80211 iTCO_wdt libcrc32c snd_intel_dspcfg snd_intel_sdw_acpi
[  419.962290]  intel_pmc_bxt mei_hdcp ee1004 kvm amdgpu iTCO_vendor_support snd_hda_codec libarc4 nfnetlink crct10dif_pclmul intel_wmi_thunderbolt mxm_wmi wmi_bmof bridge squashfs snd_hda_core crc32_pclmul stp qrtr ghash_clmulni_intel llc ns loop iwlwifi i915 aesni_intel snd_hwdep intel_spi_pci snd_pcm crypto_simd cryptd r8169 ucsi_ccg intel_spi rapl realtek gpu_sched spi_nor snd_timer typec_ucsi mei_me vfat intel_cstate mdio_devres drm_ttm_helper cfg80211 fat snd typec i2c_i801 intel_uncore pcspkr libphy mtd mei i2c_smbus rfkill roles ttm soundcore intel_pch_thermal intel_gtt wmi video acpi_tad acpi_pad mac_hid dm_multipath dm_mod ipmi_devintf ipmi_msghandler sg fuse crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid crc32c_intel xhci_pci vfio_pci vfio_pci_core irqbypass vfio_virqfd vfio_iommu_type1 vfio
[  419.962319] CPU: 6 PID: 264 Comm: kworker/6:1 Not tainted 5.15.50-1-MANJARO #1 fffffd25ed6fe5b8459d1f2fe9b1fccc660ede08
[  419.962321] Hardware name: Micro-Star International Co., Ltd. MS-7C75/Z490-A PRO (MS-7C75), BIOS 2.80 01/30/2021
[  419.962322] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  419.962494] RIP: 0010:ttm_bo_release+0x2e9/0x310 [ttm]
[  419.962498] Code: e8 dc ff 39 fb e9 b2 fd ff ff 49 8b 7e 98 b9 28 23 00 00 31 d2 be 01 00 00 00 e8 22 22 3a fb 49 8b 46 e8 eb 9e 48 89 e8 eb 99 <0f> 0b e9 38 fd ff ff e8 7b fd 39 fb e9 ef fe ff ff be 03 00 00 00
[  419.962499] RSP: 0018:ffffbbbcc076bcc8 EFLAGS: 00010202
[  419.962500] RAX: 0000000000000001 RBX: ffffbbbcc076bd10 RCX: 0000000000000001
[  419.962501] RDX: ffffa011d3f531b8 RSI: 0000000000000000 RDI: ffffa011d3f531b8
[  419.962501] RBP: ffffa0116e965270 R08: 0000000000000000 R09: 0000000000000000
[  419.962502] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa011d3f53000
[  419.962502] R13: ffffa011d3f53058 R14: ffffa011d3f531b8 R15: ffffa01160618630
[  419.962503] FS:  0000000000000000(0000) GS:ffffa0187e180000(0000) knlGS:0000000000000000
[  419.962504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  419.962505] CR2: 00007fdc809b4000 CR3: 000000017aabe004 CR4: 00000000007706e0
[  419.962505] PKRU: 55555554
[  419.962506] Call Trace:
[  419.962507]  <TASK>
[  419.962509]  amdgpu_bo_unref+0x1a/0x30 [amdgpu 4e9ddf1f39fdbedfe96056dc66601a4b2d649f8a]
[  419.962598]  amdgpu_gem_object_free+0x30/0x50 [amdgpu 4e9ddf1f39fdbedfe96056dc66601a4b2d649f8a]
[  419.962687]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x35e/0x3c0 [amdgpu 4e9ddf1f39fdbedfe96056dc66601a4b2d649f8a]
[  419.962797]  kfd_process_device_free_bos+0xa1/0xf0 [amdgpu 4e9ddf1f39fdbedfe96056dc66601a4b2d649f8a]
[  419.962903]  kfd_process_wq_release+0x20d/0x2e0 [amdgpu 4e9ddf1f39fdbedfe96056dc66601a4b2d649f8a]
[  419.963008]  process_one_work+0x1c7/0x390
[  419.963011]  worker_thread+0x4d/0x3a0
[  419.963012]  ? process_one_work+0x390/0x390
[  419.963013]  kthread+0x120/0x150
[  419.963015]  ? set_kthread_struct+0x50/0x50
[  419.963016]  ret_from_fork+0x1f/0x30
[  419.963018]  </TASK>
[  419.963019] ---[ end trace 6a8afef99fdf3d07 ]---

locations of (base address - ip address):

2C2B849
2C2B84E
2C2B84E
2C2B860
2C2B84E
2C2B84E
2C2B860
2C2B860
2C2B860
2C2B860
2C2B849

output of every addr2line -e /opt/rocm/lib/libamd_comgr.so.2.4 -fCi {the locations above}:

amd_comgr_do_action
??:?

gdb bt:

#0  0x00007f525f7dd860 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
[Current thread is 1 (Thread 0x7f4b637fe640 (LWP 3650))]
(gdb) bt
#0  0x00007f525f7dd860 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#1  0x00007f525f80adc6 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#2  0x00007f525f80c696 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#3  0x00007f525f80c751 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#4  0x00007f525f8119da in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#5  0x00007f525f8131ed in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#6  0x00007f525f84f187 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#7  0x00007f525f3d11d7 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#8  0x00007f525f3d182f in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#9  0x00007f525f45e135 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#10 0x00007f525f3d537c in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#11 0x00007f525d6acfdf in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#12 0x00007f525cc69b22 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#13 0x00007f525cc6ac2f in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#14 0x00007f525cc6b2f0 in ?? () from /opt/rocm/lib/libamd_comgr.so.2
#15 0x00007f525cc768a8 in amd_comgr_do_action () from /opt/rocm/lib/libamd_comgr.so.2
#16 0x00007f512a8924d7 in ?? () from /opt/rocm/lib/libMIOpen.so
#17 0x00007f512a8868b6 in miopen::comgr::BuildHip(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, miopen::TargetProperties const&, std::vector<char, std::allocator<char> >&) () from /opt/rocm/lib/libMIOpen.so
#18 0x00007f512a883ef4 in miopen::HIPOCProgramImpl::BuildCodeObjectInMemory(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/rocm/lib/libMIOpen.so
#19 0x00007f512a883caf in miopen::HIPOCProgramImpl::BuildCodeObject(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/rocm/lib/libMIOpen.so
#20 0x00007f512a8835a2 in miopen::HIPOCProgramImpl::HIPOCProgramImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, miopen::TargetProperties const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/rocm/lib/libMIOpen.so
#21 0x00007f512a884e20 in ?? () from /opt/rocm/lib/libMIOpen.so
#22 0x00007f512a8841ee in miopen::HIPOCProgram::HIPOCProgram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, miopen::TargetProperties const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/rocm/lib/libMIOpen.so
#23 0x00007f512a8809db in miopen::Handle::LoadProgram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::ba

Please tell me if there's more diagnostic data I can provide.

searlmc1 commented 2 years ago
lamb-j commented 2 years ago

Hi, thanks for looking into this! It looks like you're using the following: AMD Radeon RX 6950 XT Arch Linux rocm 5.2.0 installed via arch linux packages (paru?)

I'll see if I can recreate your error and track down what's causing it. In the meantime, can you try setting the following and pasting the output logs? It may give some insight into where/how Comgr is failing.

export AMD_COMGR_REDIRECT_LOGS="stdout" export AMD_COMGR_EMIT_VERBOSE_LOGS=1

7910f6ba7ee4 commented 2 years ago

Yep, that's what I'm using.

Where would the output logs be shown? After pasting the commands I checked dmesg, journalctl, the coredump, and the output of the program, but did not find anything different. Is there a specific logfile I should check?

lamb-j commented 2 years ago

The log should write to the file path assigned to AMD_COMGR_REDIRECT_LOGS (I usually use AMD_COMGR_REDIRECT_LOGS=stdout, but you can pick any file).

Another thing you can try that would be helpful is to save and upload temporary files generated during compilation. I can then try to recreate the failing step locally and track down the issue. You can do this as follows:

- clear out any comgr directories in /tmp (typically /tmp/comgr-* on linux) between executions
- export AMD_COMGR_SAVE_TEMPS=1

Intermediate files generated during compilation should then be logged in the log file and visible in /tmp.

7910f6ba7ee4 commented 2 years ago

Thanks for the help, here are the logs and the temp files.

Let me know if I can provide anything else!

lamb-j commented 2 years ago

Looking into this now! In your log file it looks like the input file name is cut off right at the end. Is this just an artifact of the application seg-faulting? I might be able to figure out what that whole command should look like, but figured I'd double check to make sure it wasn't a copy/paste issue or something similar.

7910f6ba7ee4 commented 2 years ago

I believe this is an artifact of the segfault since I just tested the issue again and it's cut off at the same line and word.

lamb-j commented 2 years ago

I believe I've been able to recreate this issue locally now. A minimal reproducer based on your temporary files:

clang "-cc1" \
      "-include-pch" "./comgr-ee6420/include/hip.pch" "-fno-validate-pch" \
      "-I" "./comgr-ee6420/include" \
      "-D" "HIP_PACKAGE_VERSION_FLAT=5002022266" \
      "-o" "./comgr-ee6420/output/naive_conv.cpp.bc" \
      "./comgr-ee6420/input/naive_conv.cpp"

PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
...
Segmentation fault (core dumped)

Building LLVM with assertions enabled and re-running gives the following:

clang: /home/lambj/git/lightning/llvm-project/llvm/include/llvm/ADT/SmallVector.h:277: llvm::SmallVectorTemplateCommon::const_reference llvm::SmallVectorTemplateCommon<unsigned long, void>::operator[](llvm::SmallVectorTemplateCommon::size_type) const [T = unsigned long]: Assertion `idx < size()' failed.
Aborted (core dumped)

I'm going to keep investigating to see if I can figure out what's happening (presumably with the llvm::SmallVectors).

7910f6ba7ee4 commented 2 years ago

Here's an updated updated_errors.log from the output of the program with more information after upgrading to 5.2.3. I assume the temp files would be the same but let me know if you need me to upload them again.

lamb-j commented 7 months ago

Is this still an issue with recent versions of ROCm? If so can you reopen here and I'll take another look?

https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr

Quickly testing the following doesn't give me any errors, but it may not be recreating the issue:

hipcc -c -I ./include naive_conv.cpp