Open pfarndt opened 1 year ago
116 is ESTALE
I just removed the using Pkg; Pkg.status()
from my scripts. Now they start up smoothly. The error occurs definitely when running Pkg.status()
on many CPU which then triggers an ESTALE error.
I can certainly life with this solution right now. But I have the suspicion that something doesn't work quite right under the hood.
Can you tell more about your filesystem? I would only expect this to happen on relatively ancient systems (circa 1995 NFS technology): https://github.com/vtjnash/Pidfile.jl/issues/4
We are using NFS with Linux 5.15.x.
@pfarndt, as you can reproduce this, I guess bisecting the Julia commit breaking this, might be a fast way to find the culprit.
@pfarndt, I can’t find /project/julia
. What is the correct path, so I can find more information about the system.
I think this issue is caused by concurrent julia processes, which try to precompile code on different CPU types while sharing the same DEPOT on a shared filesystem.
I can circumvent this issue by setting JULIA_CPU_TARGET = generic
. However then I run into other problems for specific packages https://github.com/BioJulia/FASTX.jl
In the end both should work.
Is there any progress on this? I get the same issue when activating an environment on Julia 1.10.1 (Linux 4.18, NFS 4.2).
import Pkg;
Pkg.activate(".");
Error message:
error in running finalizer: Base.IOError(msg="stat(RawFD(16)): Unknown system error -116 (Unknown system error -116)", code=-116)
uv_error at ./libuv.jl:100 [inlined]
stat at ./stat.jl:152
stat at ./filesystem.jl:281 [inlined]
close at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:336
jfptr_close_50891.1 at /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
run_finalizer at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gc.c:318
jl_gc_run_finalizers_in_list at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gc.c:408
run_finalizers at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gc.c:454
jl_mutex_unlock at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia_locks.h:80 [inlined]
jl_generate_fptr_impl at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/jitlayers.cpp:525
jl_compile_method_internal at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2480 [inlined]
jl_compile_method_internal at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2368
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2886 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
#showerror#920 at ./errorshow.jl:103
showerror at ./errorshow.jl:101
unknown function (ip: 0x1555312833c5)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
show_exception_stack at ./errorshow.jl:975
display_error at ./client.jl:111
unknown function (ip: 0x155531282e09)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
display_error at ./client.jl:114
jfptr_display_error_82458.1 at /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_f__call_latest at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/builtins.c:812
#invokelatest#2 at ./essentials.jl:892 [inlined]
invokelatest at ./essentials.jl:889 [inlined]
exec_options at ./client.jl:321
_start at ./client.jl:552
jfptr__start_82662.1 at /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Activating project at `~/Simulations/KALJ21/Analysis`
┌ Warning: attempting to remove probably stale pidfile
│ path = "/home/phys/s158686/.julia/logs/manifest_usage.toml.pid"
└ @ FileWatching.Pidfile /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:273
ERROR: LoadError: IOError: stat(RawFD(16)): Unknown system error -116 (Unknown system error -116)
Stacktrace:
[1] uv_error
@ ./libuv.jl:100 [inlined]
[2] stat(fd::RawFD)
@ Base.Filesystem ./stat.jl:152
[3] stat
@ ./filesystem.jl:281 [inlined]
[4] close(lock::FileWatching.Pidfile.LockMonitor)
@ FileWatching.Pidfile /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:336
[5] mkpidlock(f::Pkg.Types.var"#51#54"{String, String, Dates.DateTime, String}, at::String, pid::Int32; kwopts::@Kwargs{stale_age::Int64})
@ FileWatching.Pidfile /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:95
[6] mkpidlock
@ /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:90 [inlined]
[7] mkpidlock
@ /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:88 [inlined]
[8] write_env_usage(source_file::String, usage_filepath::String)
@ Pkg.Types /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/Types.jl:539
[9] Pkg.Types.EnvCache(env::Nothing)
@ Pkg.Types /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/Types.jl:377
[10] EnvCache
@ /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/Types.jl:356 [inlined]
[11] add_snapshot_to_undo(env::Nothing)
@ Pkg.API /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/API.jl:2191
[12] add_snapshot_to_undo
@ /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/API.jl:2187 [inlined]
[13] activate(path::String; shared::Bool, temp::Bool, io::IOStream)
@ Pkg.API /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/API.jl:1973
[14] activate(path::String)
@ Pkg.API /sw/rl8/zen/app/Julia/1.10.1-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/API.jl:1932
[15] top-level scope
@ /var/spool/slurmd/job47022/slurm_script:16
in expression starting at /var/spool/slurmd/job47022/slurm_script:16
For distributed computing I usually start up about 500 julia (version 1.9.0 on linux) workers, which share the same filesystem. After the switch from version 1.8.5 to 1.9.0 about 5% of the session exit with
Unknown system error -116
.I think the culprit is the command
which is on line 3 of the script that is ran by all workers and which I included solely for debugging purposes. The full stack trace is below.
Is it save to start up many julia workers that load some packages and probably have to (pre)compile them at the same time?
Thanks