Simultaneous writing to manifest_usage on NFS crashes Julia

lhupe commented 1 year ago

We are having issues with Julia processes randomly crashing when simultaneously activating environments from different machines that share a NFS file system, due to IO issues when writing to the manifest usage log.

The error messages generally look something like this (this particular example is on julia 1.9.0-rc2, but we've been seeing similar issues on different versions):

  Activating project at `/path/to/project`
┌ Warning: attempting to remove probably stale pidfile
│   path = "/path/to/.julia/logs/manifest_usage.toml.pid"
└ @ FileWatching.Pidfile /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/FileWatching/src/pidfile.jl:244
ERROR: LoadError: IOError: stat(RawFD(16)): Unknown system error -116 (Unknown system error -116)
Stacktrace:
  [1] uv_error
    @ ./libuv.jl:100 [inlined]
  [2] stat(fd::RawFD)
    @ Base.Filesystem ./stat.jl:152
  [3] stat
    @ ./filesystem.jl:280 [inlined]
  [4] close(lock::FileWatching.Pidfile.LockMonitor)
    @ FileWatching.Pidfile /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/FileWatching/src/pidfile.jl:307
  [5] mkpidlock(f::Pkg.Types.var"#51#54"{String, Dates.DateTime, String}, at::String, pid::Int32; kwopts::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:stale_age,), Tuple{Int64}}})
    @ FileWatching.Pidfile /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/FileWatching/src/pidfile.jl:84
  [6] mkpidlock
    @ /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/FileWatching/src/pidfile.jl:79 [inlined]
  [7] #mkpidlock#6
    @ /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/FileWatching/src/pidfile.jl:77 [inlined]
  [8] mkpidlock
    @ /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/FileWatching/src/pidfile.jl:77 [inlined]
  [9] write_env_usage(source_file::String, usage_filepath::String)
    @ Pkg.Types /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/Pkg/src/Types.jl:511
 [10] Pkg.Types.EnvCache(env::Nothing)
    @ Pkg.Types /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/Pkg/src/Types.jl:366
 [11] EnvCache
    @ /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/Pkg/src/Types.jl:345 [inlined]
 [12] add_snapshot_to_undo(env::Nothing)
    @ Pkg.API /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/Pkg/src/API.jl:2045
 [13] add_snapshot_to_undo
    @ /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/Pkg/src/API.jl:2041 [inlined]
 [14] activate(path::String; shared::Bool, temp::Bool, io::IOStream)
    @ Pkg.API /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/Pkg/src/API.jl:1828
 [15] activate(path::String)
    @ Pkg.API /path/to/.julia/juliaup/julia-1.9.0-rc2+0.x64.linux.gnu/share/julia/stdlib/v1.9/Pkg/src/API.jl:1787
 [16] top-level scope
    @ /path/to/script.jl:2
in expression starting at /path/to/script.jl:2

With the rare exception e.g. when this error occurs while loading a @required package, it's caught by a try-catch block, this error is fatal.

This is espcially annoying on our HPC cluster, as a significant number of jobs don't make it past the first Pkg.activate.
We are currently dealing with this by overwriting Pkg.Types.write_env_usage at the top of our scripts, but it would be nice to have "proper" way of disabling usage logging in situations where we know it will cause trouble, similar to the options for the REPL history file.

IanButterworth commented 1 year ago

I can't speak to whether this is a Pidfile limitation or a bug (@vtjnash ?), but as a mitigation we may want to just make write_env_usage and Pkg.gc() not throw but give helpful warnings.

vtjnash commented 1 year ago

This seems rather odd, since -0x74 isn't a value that function should be capable of returning. All of the codepaths should be returning some value from a list of constants, so that shouldn't be a possible value to see. Perhaps someone could go hook up uv_fs_get_system_error, so that we get more complete error information for this case

JonasIsensee commented 1 year ago

According to my quick research -116 refers to ESTALE - stale file handle not so uncommon on NFS filesystems. Given the warning at the beginning of the stacktrace, I would assume that closing / deleting the lockfile fails because a different julia process on another machine has already done so an instance before, hence invalidating the cache.

rojizo commented 3 months ago

there is no workaround for this?

JonasIsensee commented 3 months ago

there is no workaround for this?

There are different options:

Manifest usage is written to the "first" depot in the list of depots. In principle, one can start all cluster jobs with an additional first depot in a tmp folder unique to each job. This is probably the cleanest solution. (Not sure about file system usage etc., though)

Our solution is to override the relevant section of code in Pkg by adding
```
using Pkg
Pkg.Types.write_env_usage(source_file::AbstractString, usage_filepath::AbstractString) = nothing
```
at the top of every cluster job. This currently solves the problem as well, but should probably not be generally recommended.

JuliaLang / Pkg.jl

Simultaneous writing to manifest_usage on NFS crashes Julia #3453