Closed calebwin closed 2 years ago
Another time I got this issue:
ERROR: LoadError: unable to determine if efs/banyan_dataset_10244564515853285896 is accessible in the HDF5 format (file may not exist)
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] h5open(filename::String, mode::String; swmr::Bool, pv::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:430
[3] h5open(filename::String, mode::String)
@ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:412
[4] ReadBlock(src::Nothing, params::Dict{String, Any}, batch_idx::Int64, nbatches::Int64, comm::MPI.Comm, loc_name::String, loc_params::Dict{String, Any})
@ Banyan ~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/Banyan/src/pfs.jl:50
[5] exec_code(banyan_data::Dict{Any, Any})
@ Main ./string:39
[6] top-level scope
@ ~/executor.jl:104
in expression starting at /home/ec2-user/executor.jl:83
srun: error: compute-dy-t3large-2: task 0: Exited with exit code 1
slurmstepd: error: compute-dy-t3large-2 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 2600.0 ON compute-dy-t3large-2 CANCELLED AT 2021-11-22T01:12:01 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t3large-2: task 1: Killed
[ Info: Destroying job with ID 2021-11-22-01093521e12ec925f60ef4f3a060e4baf06820
[ Info: Destroying job with ID 2021-11-22-01093521e12ec925f60ef4f3a060e4baf06820
Another instance:
ERROR: LoadError: unable to determine if efs/banyan_dataset_10244564515853285896 is accessible in the HDF5 format (file may not exist)
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] h5open(filename::String, mode::String; swmr::Bool, pv::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:430
[3] h5open(filename::String, mode::String)
@ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:412
[4] ReadBlock(src::Nothing, params::Dict{String, Any}, batch_idx::Int64, nbatches::Int64, comm::MPI.Comm, loc_name::String, loc_params::Dict{String, Any})
@ Banyan ~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/Banyan/src/pfs.jl:50
[5] exec_code(banyan_data::Dict{Any, Any})
@ Main ./string:39
[6] top-level scope
@ ~/executor.jl:104
in expression starting at /home/ec2-user/executor.jl:83
srun: error: compute-dy-t3large-1: task 0: Exited with exit code 1
slurmstepd: error: compute-dy-t3large-1 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 2614.0 ON compute-dy-t3large-1 CANCELLED AT 2021-11-22T04:03:50 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t3large-1: task 1: Killed
[ Info: Destroying job with ID 2021-11-22-04011057fe598031c50960ab5c834d2b42f286
[ Info: Destroying job with ID 2021-11-22-04011057fe598031c50960ab5c834d2b42f286
Variation of this error:
Building MPI `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/340d8dc89e1c85a846d3f38ee294bfdd1684055a/build.log`
Building HDF5 `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/698c099c6613d7b7f151832868728f426abe698b/build.log`
Activating environment at `~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/BanyanArrays/test/Project.toml`
Activating environment at `~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/BanyanArrays/test/Project.toml`
Getting next execution request
Getting next execution request
Getting next execution request
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 0:
#000: H5F.c line 620 in H5Fopen(): unable to open file
major: File accessibility
minor: Unable to open file
#001: H5VLcallback.c line 3502 in H5VL_file_open(): failed to iterate over available VOL connector plugins
major: Virtual Object Layer
minor: Iteration failed
#002: H5PLpath.c line 579 in H5PL__path_table_iterate(): can't iterate over plugins in plugin path '(null)'
major: Plugin for dynamically loaded library
minor: Iteration failed
#003: H5PLpath.c line 620 in H5PL__path_table_iterate_process_path(): can't open directory: /usr/local/hdf5/lib/plugin
major: Plugin for dynamically loaded library
minor: Can't open directory or file
#004: H5VLcallback.c line 3351 in H5VL__file_open(): open failed
major: Virtual Object Layer
minor: Can't open object
#005: H5VLnative_file.c line 97 in H5VL__native_file_open(): unable to open file
major: File accessibility
minor: Unable to open file
#006: H5Fint.c line 1990 in H5F_open(): unable to read superblock
major: File accessibility
minor: Read failed
#007: H5Fsuper.c line 617 in H5F__super_read(): truncated file: eof = 2026732, sblock->base_addr = 0, stored_eof = 2402848
major: File accessibility
minor: File has been truncated
ERROR: LoadError: Error opening file efs/banyan_dataset_10244564515853285896
Stacktrace:
[1] error(::String, ::String)
@ Base ./error.jl:42
[2] h5f_open(pathname::String, flags::UInt16, fapl_id::HDF5.Properties)
@ HDF5 ~/.julia/packages/HDF5/pIJra/src/api.jl:761
[3] h5open(filename::String, mode::String; swmr::Bool, pv::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:436
[4] h5open(filename::String, mode::String)
@ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:412
[5] ReadBlock(src::Nothing, params::Dict{String, Any}, batch_idx::Int64, nbatches::Int64, comm::MPI.Comm, loc_name::String, loc_params::Dict{String, Any})
@ Banyan ~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/Banyan/src/pfs.jl:50
[6] exec_code(banyan_data::Dict{Any, Any})
@ Main ./string:32
[7] top-level scope
@ ~/executor.jl:104
in expression starting at /home/ec2-user/executor.jl:83
srun: error: compute-dy-t3large-1: task 0: Exited with exit code 1
slurmstepd: error: compute-dy-t3large-1 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 2618.0 ON compute-dy-t3large-1 CANCELLED AT 2021-11-22T04:21:50 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t3large-1: task 1: Killed
[ Info: Destroying job with ID 2021-11-22-041923e06d7d18a7fa51cc3d537261d385b9ab
[ Info: Destroying job with ID 2021-11-22-041923e06d7d18a7fa51cc3d537261d385b9ab
This issue could be because we are downloading to EFS multiple times on different nodes potentially causing conflicts. It might be a good idea to go through all places where we interact with the shared NFS, EFS, or S3FS in pfs.jl
and utils_pfs.jl
and ensure that we are not reading to the same location on different nodes or reading without necessary barriers or fsync
s.
Note that we have also had similar issues when interacting with EFS with recursive, forced rm
and cp
.
Based on https://docs.aws.amazon.com/efs/latest/ug/how-it-works.html, we need to make sure we fsync or close files to ensure consistency.
This is actually because of .nfs*
files [1] being created in the process of recursively force-deleting files. However, this does not explain the failures with cp
.
Closing for now but will reopen if an issue comes up with cp
.
Unable to recursively delete directory: