Closed 7910f6ba7ee4 closed 7 months ago
Hi, thanks for looking into this! It looks like you're using the following: AMD Radeon RX 6950 XT Arch Linux rocm 5.2.0 installed via arch linux packages (paru?)
I'll see if I can recreate your error and track down what's causing it. In the meantime, can you try setting the following and pasting the output logs? It may give some insight into where/how Comgr is failing.
export AMD_COMGR_REDIRECT_LOGS="stdout" export AMD_COMGR_EMIT_VERBOSE_LOGS=1
Yep, that's what I'm using.
Where would the output logs be shown? After pasting the commands I checked dmesg, journalctl, the coredump, and the output of the program, but did not find anything different. Is there a specific logfile I should check?
The log should write to the file path assigned to AMD_COMGR_REDIRECT_LOGS (I usually use AMD_COMGR_REDIRECT_LOGS=stdout, but you can pick any file).
Another thing you can try that would be helpful is to save and upload temporary files generated during compilation. I can then try to recreate the failing step locally and track down the issue. You can do this as follows:
- clear out any comgr directories in /tmp (typically /tmp/comgr-* on linux) between executions
- export AMD_COMGR_SAVE_TEMPS=1
Intermediate files generated during compilation should then be logged in the log file and visible in /tmp.
Thanks for the help, here are the logs and the temp files.
Let me know if I can provide anything else!
Looking into this now! In your log file it looks like the input file name is cut off right at the end. Is this just an artifact of the application seg-faulting? I might be able to figure out what that whole command should look like, but figured I'd double check to make sure it wasn't a copy/paste issue or something similar.
I believe this is an artifact of the segfault since I just tested the issue again and it's cut off at the same line and word.
I believe I've been able to recreate this issue locally now. A minimal reproducer based on your temporary files:
clang "-cc1" \
"-include-pch" "./comgr-ee6420/include/hip.pch" "-fno-validate-pch" \
"-I" "./comgr-ee6420/include" \
"-D" "HIP_PACKAGE_VERSION_FLAT=5002022266" \
"-o" "./comgr-ee6420/output/naive_conv.cpp.bc" \
"./comgr-ee6420/input/naive_conv.cpp"
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
...
Segmentation fault (core dumped)
Building LLVM with assertions enabled and re-running gives the following:
clang: /home/lambj/git/lightning/llvm-project/llvm/include/llvm/ADT/SmallVector.h:277: llvm::SmallVectorTemplateCommon::const_reference llvm::SmallVectorTemplateCommon<unsigned long, void>::operator[](llvm::SmallVectorTemplateCommon::size_type) const [T = unsigned long]: Assertion `idx < size()' failed.
Aborted (core dumped)
I'm going to keep investigating to see if I can figure out what's happening (presumably with the llvm::SmallVectors).
Here's an updated updated_errors.log from the output of the program with more information after upgrading to 5.2.3. I assume the temp files would be the same but let me know if you need me to upload them again.
Is this still an issue with recent versions of ROCm? If so can you reopen here and I'll take another look?
https://github.com/ROCm/llvm-project/tree/amd-staging/amd/comgr
Quickly testing the following doesn't give me any errors, but it may not be recreating the issue:
hipcc -c -I ./include naive_conv.cpp
Hello, I've just recently installed ROCm 5.2.0 on arch with the rocm-arch repository. Everything has worked up to this point (no initial errors, tensorflow works,
clinfo
,rocm-smi
, androcminfo
produce outputs).When trying to train a network, python stops at epoch 1 for a few minutes before ending with:
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
The specific output when running the program (before segfault):
dmesg errors (each segfault is an attempt):
possibly associated dmesg stacktrace:
locations of (base address - ip address):
output of every addr2line -e /opt/rocm/lib/libamd_comgr.so.2.4 -fCi {the locations above}:
gdb bt:
Please tell me if there's more diagnostic data I can provide.