banyan-team / banyan-julia

A suite of familiar Julia APIs for bigger datasets with less cloud and lower costs.
https://banyancomputing.com
Apache License 2.0
17 stars 1 forks source link

Job sometimes crashes with EFS-related message #79

Closed calebwin closed 2 years ago

calebwin commented 2 years ago

Unable to recursively delete directory:

[ec2-user@ip-172-31-43-16 ~]$ cat banyan-output-for-job-2021-11-20-082050016fec17e589901c38f6cc0067f1d1bc
Sent end message
Getting next execution request
Getting next execution request
ERROR: LoadError: SystemError (with efs/job_2021-11-20-082050016fec17e589901c38f6cc0067f1d1bc_val_1360): rmdir: Directory not empty
Stacktrace:
 [1] systemerror(p::Symbol, errno::Int32; extrainfo::String)
   @ Base ./error.jl:168
 [2] #systemerror#62
   @ ./error.jl:167 [inlined]
 [3] rm(path::String; force::Bool, recursive::Bool)
   @ Base.Filesystem ./file.jl:290
 [4] Write(src::Nothing, part::DataFrames.DataFrame, params::Dict{String, Any}, batch_idx::Int64, nbatches::Int64, comm::MPI.Comm, loc_name::String, loc_params::Dict{String, Any})
   @ Banyan ~/3cdb527a673d0fb779b82f7d87defea517cb009d7076c3a26fc951f51fa89334/banyan-julia/Banyan/src/pfs.jl:458
 [5] exec_code(banyan_data::Dict{Any, Any})
   @ Main ./string:346
 [6] top-level scope
   @ ~/executor.jl:104
in expression starting at /home/ec2-user/executor.jl:83
srun: error: compute-dy-t3large-1: task 0: Exited with exit code 1
slurmstepd: error: compute-dy-t3large-1 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 2556.0 ON compute-dy-t3large-1 CANCELLED AT 2021-11-20T09:39:32 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t3large-1: task 1: Killed
calebwin commented 2 years ago

Another time I got this issue:

ERROR: LoadError: unable to determine if efs/banyan_dataset_10244564515853285896 is accessible in the HDF5 format (file may not exist)
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] h5open(filename::String, mode::String; swmr::Bool, pv::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:430
 [3] h5open(filename::String, mode::String)
   @ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:412
 [4] ReadBlock(src::Nothing, params::Dict{String, Any}, batch_idx::Int64, nbatches::Int64, comm::MPI.Comm, loc_name::String, loc_params::Dict{String, Any})
   @ Banyan ~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/Banyan/src/pfs.jl:50
 [5] exec_code(banyan_data::Dict{Any, Any})
   @ Main ./string:39
 [6] top-level scope
   @ ~/executor.jl:104
in expression starting at /home/ec2-user/executor.jl:83
srun: error: compute-dy-t3large-2: task 0: Exited with exit code 1
slurmstepd: error: compute-dy-t3large-2 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 2600.0 ON compute-dy-t3large-2 CANCELLED AT 2021-11-22T01:12:01 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t3large-2: task 1: Killed

[ Info: Destroying job with ID 2021-11-22-01093521e12ec925f60ef4f3a060e4baf06820
[ Info: Destroying job with ID 2021-11-22-01093521e12ec925f60ef4f3a060e4baf06820

Another instance:

ERROR: LoadError: unable to determine if efs/banyan_dataset_10244564515853285896 is accessible in the HDF5 format (file may not exist)
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] h5open(filename::String, mode::String; swmr::Bool, pv::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:430
 [3] h5open(filename::String, mode::String)
   @ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:412
 [4] ReadBlock(src::Nothing, params::Dict{String, Any}, batch_idx::Int64, nbatches::Int64, comm::MPI.Comm, loc_name::String, loc_params::Dict{String, Any})
   @ Banyan ~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/Banyan/src/pfs.jl:50
 [5] exec_code(banyan_data::Dict{Any, Any})
   @ Main ./string:39
 [6] top-level scope
   @ ~/executor.jl:104
in expression starting at /home/ec2-user/executor.jl:83
srun: error: compute-dy-t3large-1: task 0: Exited with exit code 1
slurmstepd: error: compute-dy-t3large-1 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 2614.0 ON compute-dy-t3large-1 CANCELLED AT 2021-11-22T04:03:50 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t3large-1: task 1: Killed

[ Info: Destroying job with ID 2021-11-22-04011057fe598031c50960ab5c834d2b42f286
[ Info: Destroying job with ID 2021-11-22-04011057fe598031c50960ab5c834d2b42f286
calebwin commented 2 years ago

Variation of this error:

    Building MPI   `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/340d8dc89e1c85a846d3f38ee294bfdd1684055a/build.log`
    Building HDF5   `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/698c099c6613d7b7f151832868728f426abe698b/build.log`
  Activating environment at `~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/BanyanArrays/test/Project.toml`
  Activating environment at `~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/BanyanArrays/test/Project.toml`
Getting next execution request
Getting next execution request
Getting next execution request
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 0:
  #000: H5F.c line 620 in H5Fopen(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #001: H5VLcallback.c line 3502 in H5VL_file_open(): failed to iterate over available VOL connector plugins
    major: Virtual Object Layer
    minor: Iteration failed
  #002: H5PLpath.c line 579 in H5PL__path_table_iterate(): can't iterate over plugins in plugin path '(null)'
    major: Plugin for dynamically loaded library
    minor: Iteration failed
  #003: H5PLpath.c line 620 in H5PL__path_table_iterate_process_path(): can't open directory: /usr/local/hdf5/lib/plugin
    major: Plugin for dynamically loaded library
    minor: Can't open directory or file
  #004: H5VLcallback.c line 3351 in H5VL__file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #005: H5VLnative_file.c line 97 in H5VL__native_file_open(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #006: H5Fint.c line 1990 in H5F_open(): unable to read superblock
    major: File accessibility
    minor: Read failed
  #007: H5Fsuper.c line 617 in H5F__super_read(): truncated file: eof = 2026732, sblock->base_addr = 0, stored_eof = 2402848
    major: File accessibility
    minor: File has been truncated
ERROR: LoadError: Error opening file efs/banyan_dataset_10244564515853285896
Stacktrace:
 [1] error(::String, ::String)
   @ Base ./error.jl:42
 [2] h5f_open(pathname::String, flags::UInt16, fapl_id::HDF5.Properties)
   @ HDF5 ~/.julia/packages/HDF5/pIJra/src/api.jl:761
 [3] h5open(filename::String, mode::String; swmr::Bool, pv::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:436
 [4] h5open(filename::String, mode::String)
   @ HDF5 ~/.julia/packages/HDF5/pIJra/src/HDF5.jl:412
 [5] ReadBlock(src::Nothing, params::Dict{String, Any}, batch_idx::Int64, nbatches::Int64, comm::MPI.Comm, loc_name::String, loc_params::Dict{String, Any})
   @ Banyan ~/17d11e66635497fe6fa18ecd7b2364743c53d52ab7270e07d9fb5e1556e5a3e6/banyan-julia/Banyan/src/pfs.jl:50
 [6] exec_code(banyan_data::Dict{Any, Any})
   @ Main ./string:32
 [7] top-level scope
   @ ~/executor.jl:104
in expression starting at /home/ec2-user/executor.jl:83
srun: error: compute-dy-t3large-1: task 0: Exited with exit code 1
slurmstepd: error: compute-dy-t3large-1 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 2618.0 ON compute-dy-t3large-1 CANCELLED AT 2021-11-22T04:21:50 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t3large-1: task 1: Killed

[ Info: Destroying job with ID 2021-11-22-041923e06d7d18a7fa51cc3d537261d385b9ab
[ Info: Destroying job with ID 2021-11-22-041923e06d7d18a7fa51cc3d537261d385b9ab
calebwin commented 2 years ago

This issue could be because we are downloading to EFS multiple times on different nodes potentially causing conflicts. It might be a good idea to go through all places where we interact with the shared NFS, EFS, or S3FS in pfs.jl and utils_pfs.jl and ensure that we are not reading to the same location on different nodes or reading without necessary barriers or fsyncs.

calebwin commented 2 years ago

Note that we have also had similar issues when interacting with EFS with recursive, forced rm and cp.

calebwin commented 2 years ago

Based on https://docs.aws.amazon.com/efs/latest/ug/how-it-works.html, we need to make sure we fsync or close files to ensure consistency.

calebwin commented 2 years ago

This is actually because of .nfs* files [1] being created in the process of recursively force-deleting files. However, this does not explain the failures with cp.

[1] https://www.ibm.com/support/pages/what-are-nfs-files-accumulate-and-why-cant-they-be-deleted-even-after-stopping-cognos-8

calebwin commented 2 years ago

Closing for now but will reopen if an issue comes up with cp.