Closed rafaqz closed 1 year ago
@maleadt
I'm aware, but unsure what to do. It seems like a Julia bug triggered by CUDNN's artifact selection, but that doesn't involve doing anything weird with LLVM (only loading CUDA drivers and libraries, which has worked out fine with other CUDA JLLs). Without a consistent reproducer, this is almost impossible to debug.
I can semi-consistently reproduce this in my system. Its seems to be when I add
anything when some packages are already imported in the session.
It doesnt happen in a fresh session, and it doesnt happen after I delete the compiled folder.
I will try and catch more of the logic of it, but doing it intentionally requires delering the compiled folder so the iteration time is very slow.
Ideally we'd capture this in rr. Normally that just requires running Julia with --bug-report
, but in the case of a precompilation process that doesn't really work...
Because its happening in a separate julia process?
Can we just add --bug-report
to the call to the julia process in Pkg.Operations.collect_artifacts
? (I don't really use rr so just a guess)
A very consistent failure I'm getting is on resolve
when I already have the packages loaded in the session.
Can we just add
--bug-report
to the call to the julia process inPkg.Operations.collect_artifacts
? (I don't really use rr so just a guess)
We could, but --bug-report
ends with an interactive part where it prompts you to do something in the browser in order to upload a report. Although I guess you could have it run with --bug-report=rr-local
, which disables the upload and just puts the trace in ~/.rr
. One potential issue with that, is that --bug-report
will attempt to install BugReporting.jl
, and thus perform Pkg operations and do some precompilation, which may conflict with the environment that the precompilation process runs in (and/or cause recursion).
It may be easier to install rr
locally and just prefix the precompilation spawn with rr record
and do some of the BugReport.jl-postprocessing manually afterwards.
Ok probably manually then, not sure I have the time or experience to do this currently.
What I'm finding strange is that resolve
in a fresh session never seems to break, but nearly always breaks when packages are loaded.
So why does the main process interact with the artifacts Julia process in any way? I didn't think they would be sharing anything that's in memory?
Ok probably manually then, not sure I have the time or experience to do this currently.
Alternatively, if you can come up with something that reproduces deterministically (e.g., starting from a fresh depot by setting JULIA_DEPOTPATH=$(mktemp -d)
) I can take a look at trying to shoehorn rr
into it.
I just got the segfault in a fresh session so that theory is gone too. Will see if I can find some time to make something deterministic.
(WVZAnalysis) pkg> up
Updating registry at `~/.julia/registries/General.toml`
Installed XGBoost_jll ─ v1.7.5+0
error: <inline asm>:1:2: invalid character in input
4�
[2837042] signal (11.1): Segmentation fault
in expression starting at none:0
__run_exit_handlers at /lib64/libc.so.6 (unknown line)
exit at /lib64/libc.so.6 (unknown line)
*** Error in `/cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia': corrupted double-linked list: 0x0000000001ae10c0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x80a4f)[0x7f3c67c80a4f]
/lib64/libc.so.6(+0x82135)[0x7f3c67c82135]
/lib64/libc.so.6(__libc_calloc+0xb4)[0x7f3c67c86214]
/home/jiling/.julia/artifacts/ebadc1bf983003ca3f714f062af4451365761171/lib/libcublasLt.so.11(+0x5a224c3)[0x7f39ebe3e4c3]
/home/jiling/.julia/artifacts/ebadc1bf983003ca3f714f062af4451365761171/lib/libcublasLt.so.11(+0x5a23b78)[0x7f39ebe3fb78]
/lib64/libc.so.6(+0x39ce9)[0x7f3c67c39ce9]
/lib64/libc.so.6(+0x39d37)[0x7f3c67c39d37]
/cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia(_start+0x0)[0x401070]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f3c67c22555]
/cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia[0x401099]
======= Memory map: ========
00400000-00401000 r--p 00000000 00:69 3713014 /cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia
00401000-00402000 r-xp 00001000 00:69 3713014 /cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia
00402000-00403000 r--p 00002000 00:69 3713014 /cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia
00403000-00404000 r--p 00002000 00:69 3713014 /cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia
00404000-00405000 rw-p 00003000 00:69 3713014 /cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia
011e0000-020a3000 rw-p 00000000 00:00 0 [heap]
7f398c000000-7f398c021000 rw-p 00000000 00:00 0
7f398c021000-7f3990000000 ---p 00000000 00:00 0
7f3990a00000-7f3_ZN4llvm11LLVMContext8diagnoseERKNS_14DiagnosticInfoE at /cvmfs/sft-nightlies.cern.ch/lcg/views/dev4/Thu/x86_64-centos7-gcc11-opt/lib/julia/libLLVM-14jl.so (unknown line)
[2837042] signal (6.-6): Aborted
in expression starting at none:0
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
__libc_message at /lib64/libc.so.6 (unknown line)
malloc_consolidate at /lib64/libc.so.6 (unknown line)
_int_malloc at /lib64/libc.so.6 (unknown line)
__libc_calloc at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x7f39ebe3e4c2)
unknown function (ip: 0x7f39ebe3fb77)
__run_exit_handlers at /lib64/libc.so.6 (unknown line)
exit at /lib64/libc.so.6 (unknown line)
main at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/cli/loader_exe.c:62
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
Allocations: 2997 (Pool: 2984; Big: 13); GC: 0
ERROR: failed process: Process(`/cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/bin/julia -Cnative -J/cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/lib/julia/sys.so -g1 -O0 --color=no --history-file=no --startup-file=no --compiled-modules=yes --project=/home/jiling/.julia/dev/WVZAnalysis/Project.toml --eval 'append!(empty!(Base.DEPOT_PATH), ["/home/jiling/.julia", "/cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/local/share/julia", "/cvmfs/sft-nightlies.cern.ch/lcg/latest/julia/1.9.0-30f63/x86_64-centos7-gcc11-opt/share/julia", "/cvmfs/sft-nightlies.cern.ch/lcg/views/dev4/Thu/x86_64-centos7-gcc11-opt/share/julia"])
append!(empty!(Base.DL_LOAD_PATH), String[])
this is now hitting XGBoost_jll starting 1.7.4 and 1.7.5
It stopped happening for me after updating my julia version.
I haven't encountered it recently either.
I keep hitting this with anything to do with Pkg, when it runs
collect_artifacts
. It happens from many commands but always segaults in CUDNN_jll buildsAnd the entry for CUDNN_jll in my manifest: