JuliaData / MemPool.jl

High-performance parallel and distributed datastore for Julia
Other
23 stars 15 forks source link

Caching error on linux when removing cache files #64

Open krynju opened 1 year ago

krynju commented 1 year ago

Appears sometimes when process exits

IOError: unlink("/home/krynju/.mempool/sess-utvz1V-1/h2x1LD/jl_N2bctMjqbi"): no such file or directory (ENOENT)
Stacktrace:
 [1] uv_error
   @ ./libuv.jl:97 [inlined]
 [2] unlink(p::String)
   @ Base.Filesystem ./file.jl:972
 [3] rm(path::String; force::Bool, recursive::Bool)
   @ Base.Filesystem ./file.jl:283
 [4] rm(path::String; force::Bool, recursive::Bool) (repeats 2 times)
   @ Base.Filesystem ./file.jl:294
 [5] (::MemPool.var"#203#206"{Int64})()
   @ MemPool ~/.julia/packages/MemPool/Ggdm4/src/MemPool.jl:163
 [6] _atexit()
   @ Base ./initdefs.jl:372
jpsamaroo commented 1 year ago

This sounds like the rm(...; recursive=true) call in our atexit cleanup hook is racing with the eviction process; it's not technically possible to ensure that all files are cleaned up in time, so we could pass force=true to ignore these errors, but that does make me feel slightly uncomfortable for unknown reasons. Thoughts?

StevenWhitaker commented 11 months ago

FYI I have also sometimes seen this issue on WSL 2 Ubuntu when exiting Julia.

Also, it actually might be reproducible, as I've gotten this error three times in a row with the MWE in https://github.com/JuliaParallel/DTables.jl/issues/60#issuecomment-1808665528, but with enable_disk_caching!(50, 10^2 * 20) (and I just realized my typo, I meant to do 2^10 * 20) inserted after loading packages:

julia> include("mwe.jl")

julia> for i = 1:100 main() end
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
ERROR: On worker 2:
AssertionError: Failed to migrate 183.839 MiB for ref 349
Stacktrace:
  [1] #105
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:887
  [2] with_lock
    @ ~/.julia/packages/MemPool/l9nLj/src/lock.jl:80
  [3] #sra_migrate!#103
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:849
  [4] sra_migrate!
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:826 [inlined]
  [5] write_to_device!
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:817
  [6] #poolset#160
    @ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:386
  [7] #tochunk#139
    @ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:267
  [8] tochunk (repeats 2 times)
    @ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:259 [inlined]
  [9] #DTable#1
    @ ~/.julia/packages/DTables/BjdY2/src/table/dtable.jl:38
 [10] DTable
    @ ~/.julia/packages/DTables/BjdY2/src/table/dtable.jl:28
 [11] #create_dt_from_cols#9
    @ ~/tmp/mwe.jl:76
 [12] create_dt_from_cols
    @ ~/tmp/mwe.jl:68 [inlined]
 [13] update_value_col!
    @ ~/tmp/mwe.jl:88
 [14] query
    @ ~/tmp/mwe.jl:27
 [15] #invokelatest#2
    @ ./essentials.jl:819 [inlined]
 [16] invokelatest
    @ ./essentials.jl:816
 [17] #110
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285
 [18] run_work_thunk
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
 [19] macro expansion
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285 [inlined]
 [20] #109
    @ ./task.jl:514
Stacktrace:
 [1] remotecall_fetch(::Function, ::Distributed.Worker; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:465
 [2] remotecall_fetch(::Function, ::Distributed.Worker)
   @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
 [3] #remotecall_fetch#162
   @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
 [4] remotecall_fetch
   @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
 [5] main
   @ ~/tmp/mwe.jl:19 [inlined]
 [6] top-level scope
   @ ./REPL[2]:1

julia> # Exit Julia
┌ Warning: Worker 3 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:529
┌ Warning: Worker 5 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:529
┌ Warning: Worker 4 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:529
      From worker 2:    IOError: unlink("/home/steven/.mempool/sess-Qsvl77-2/RHtbsR/jl_JWnIX2z29e"): no such file or directory (ENOENT)
      From worker 2:    Stacktrace:
      From worker 2:      [1]┌ Error: Fatal error on process 2
      From worker 2:    │   exception =
      From worker 2:    │    attempt to send to unknown socket
      From worker 2:    │    Stacktrace:
      From worker 2:    │     [1] error(s::String)
      From worker 2:    │       @ Base ./error.jl:35
      From worker 2:    │     [2] send_msg_unknown(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
      From worker 2:    │       @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:99
      From worker 2:    │     [3] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
      From worker 2:    │       @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:115
      From worker 2:    │     [4] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::Nothing)
      From worker 2:    │       @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:102
      From worker 2:    │     [5] macro expansion
      From worker 2:    │       @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:302 [inlined]
      From worker 2:    │     [6] (::Distributed.var"#113#115"{Distributed.CallWaitMsg, Distributed.MsgHeader, Sockets.TCPSocket})()
      From worker 2:    │       @ Distributed ./task.jl:514
      From worker 2:    └ @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:106
      From worker 2:     uv_error
      From worker 2:        @ ./libuv.jl:100 [inlined]
      From worker 2:      [2] unlink(p::String)
      From worker 2:        @ Base.Filesystem ./file.jl:972
      From worker 2:      [3] rm(path::String; force::Bool, recursive::Bool)
      From worker 2:        @ Base.Filesystem ./file.jl:283
      From worker 2:      [4] rm(path::String; force::Bool, recursive::Bool) (repeats 2 times)
      From worker 2:        @ Base.Filesystem ./file.jl:294
      From worker 2:      [5] rm
      From worker 2:        @ ./file.jl:273 [inlined]
      From worker 2:      [6] exit_hook()
      From worker 2:        @ MemPool ~/.julia/packages/MemPool/l9nLj/src/MemPool.jl:152
      From worker 2:      [7] _atexit(exitcode::Int32)
      From worker 2:        @ Base ./initdefs.jl:416
      From worker 2:      [8] exit
      From worker 2:        @ ./initdefs.jl:28 [inlined]
      From worker 2:      [9] exit()
      From worker 2:        @ Base ./initdefs.jl:29
      From worker 2:     [10] #invokelatest#2
      From worker 2:        @ ./essentials.jl:819 [inlined]
      From worker 2:     [11] invokelatest(::Any)
      From worker 2:        @ Base ./essentials.jl:816
      From worker 2:     [12] (::Distributed.var"#118#120"{Distributed.RemoteDoMsg})()
      From worker 2:        @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:308
      From worker 2:     [13] run_work_thunk(thunk::Distributed.var"#118#120"{Distributed.RemoteDoMsg}, print_error::Bool)
      From worker 2:        @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
      From worker 2:     [14] (::Distributed.var"#117#119"{Distributed.RemoteDoMsg})()
      From worker 2:        @ Distributed ./task.jl:514
┌ Warning: Worker 2 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:529

EDIT: I corrected my typo. Now I don't get the AssertionError, but I still get the IOError when exiting Julia.

jpsamaroo commented 11 months ago

The IOError is generally harmless, the file will be removed one way or the other (if it doesn't, let me know!). The AssertionError should be mostly "fixed" on master, but we might need to be a bit more eager with freeing data to keep within the size bounds we've set.