Open lhupe opened 1 year ago
I can't speak to whether this is a Pidfile limitation or a bug (@vtjnash ?), but as a mitigation we may want to just make write_env_usage
and Pkg.gc()
not throw but give helpful warnings.
This seems rather odd, since -0x74
isn't a value that function should be capable of returning. All of the codepaths should be returning some value from a list of constants, so that shouldn't be a possible value to see. Perhaps someone could go hook up uv_fs_get_system_error
, so that we get more complete error information for this case
According to my quick research -116
refers to ESTALE - stale file handle not so uncommon on NFS filesystems.
Given the warning at the beginning of the stacktrace,
I would assume that closing / deleting the lockfile fails because a different julia process on another machine has already done so an instance before, hence invalidating the cache.
there is no workaround for this?
there is no workaround for this?
There are different options:
tmp
folder unique to each job. This is probably the cleanest solution. (Not sure about file system usage etc., though)using Pkg
Pkg.Types.write_env_usage(source_file::AbstractString, usage_filepath::AbstractString) = nothing
at the top of every cluster job. This currently solves the problem as well, but should probably not be generally recommended.
We are having issues with Julia processes randomly crashing when simultaneously activating environments from different machines that share a NFS file system, due to IO issues when writing to the manifest usage log.
The error messages generally look something like this (this particular example is on julia 1.9.0-rc2, but we've been seeing similar issues on different versions):
With the rare exception e.g. when this error occurs while loading a
@require
d package, it's caught by a try-catch block, this error is fatal.This is espcially annoying on our HPC cluster, as a significant number of jobs don't make it past the first
Pkg.activate
.We are currently dealing with this by overwriting
Pkg.Types.write_env_usage
at the top of our scripts, but it would be nice to have "proper" way of disabling usage logging in situations where we know it will cause trouble, similar to the options for the REPL history file.