JuliaParallel / DistributedArrays.jl

Distributed Arrays in Julia
Other
197 stars 35 forks source link

fix `close` error due to race in `d_closeall` #248

Closed tanmaykm closed 1 year ago

tanmaykm commented 1 year ago

It seems possible that DistributedArrays.d_closeall() may encounter a condition where it finds a darray id in the registry, but the corresponding weakref value is nothing because the referenced darray got garbage collected. It has been enountered many times in CI and elsewhere, but is hard to replicate normally. Adding a check for the weakref value, before actually invoking close on it, to fix it.

fixes #246

tanmaykm commented 1 year ago

The below code that adds some code to regulate GC behavior can recreate the situation.

julia> using Distributed

julia> addprocs(4);

julia> @everywhere begin
           using Distributed, DistributedArrays
       end

julia> A = rand(1:100, (100,100));

julia> DA = distribute(A);

julia> GC.enable(false)
true

julia> DA = nothing

julia> function DistributedArrays.d_from_weakref_or_d(id)
           d = get(DistributedArrays.registry, id, nothing)
           GC.enable(true)
           GC.gc()
           isa(d, WeakRef) && return d.value
           return d
       end

julia> DistributedArrays.d_closeall()
ERROR: MethodError: no method matching close(::Nothing)

Closest candidates are:
  close(::Union{Base.AsyncCondition, Timer})
   @ Base asyncevent.jl:162
  close(::Union{FileWatching.FileMonitor, FileWatching.FolderMonitor, FileWatching.PollingFileWatcher})
   @ FileWatching /data/Work/julia/binaries/julia-1.9.3/share/julia/stdlib/v1.9/FileWatching/src/FileWatching.jl:328
  close(::DArray)
   @ DistributedArrays ~/.julia/dev/DistributedArrays/src/core.jl:34
  ...

Stacktrace:
 [1] d_closeall()
   @ DistributedArrays ~/.julia/dev/DistributedArrays/src/core.jl:47
 [2] top-level scope
   @ REPL[9]:1
andreasnoack commented 1 year ago

Can we be sure that the remote memory is properly freed if the reference has been garbage collected?

tanmaykm commented 1 year ago

Yes, I think so, because the finalizer invokes close. Did a small test to confirm that:

julia> using Distributed

julia> addprocs(4);

julia> @everywhere begin
           using Distributed, DistributedArrays

           function print_registry_entries()
               for k in keys(DistributedArrays.registry)
                   hasval = !(DistributedArrays.d_from_weakref_or_d(k) === nothing)
                   println("key $k hasval $hasval")
               end
           end
       end

julia> @everywhere print_registry_entries()

julia> A = rand(1:100, (100,100));

julia> DA = distribute(A);

julia> GC.enable(false)
true

julia> DA = nothing

julia> @everywhere print_registry_entries()
key (1, 1) hasval true
      From worker 5:    key (1, 1) hasval true
      From worker 3:    key (1, 1) hasval true
      From worker 2:    key (1, 1) hasval true
      From worker 4:    key (1, 1) hasval true

julia> function DistributedArrays.d_from_weakref_or_d(id)
           d = get(DistributedArrays.registry, id, nothing)
           GC.enable(true)
           GC.gc()
           isa(d, WeakRef) && return d.value
           return d
       end

julia> @everywhere print_registry_entries()
key (1, 1) hasval false
      From worker 2:    key (1, 1) hasval true
      From worker 3:    key (1, 1) hasval true
      From worker 5:    key (1, 1) hasval true
      From worker 4:    key (1, 1) hasval true

julia> DistributedArrays.d_closeall()

julia> @everywhere print_registry_entries()

Would it be good to add this as a test?