JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

RemoteRef memory leak when serialized to a different worker #25

Open amitmurthy opened 9 years ago

amitmurthy commented 9 years ago

Scenario:

julia> addprocs(2)
2-element Array{Int64,1}:
 2
 3

julia> # create a reference on pid 2
       rr = RemoteRef(2)
RemoteRef(2,1,4)

julia> # See if anything has actually been created on worker 2
       Base.remote_do(2, ()->println(keys(Base.PGRP.refs)))

        From worker 2:  Any[]

julia> # Nope, nothing
       put!(rr, :OK)
RemoteRef(2,1,4)

julia> # Now, see again
       Base.remote_do(2, ()->println(keys(Base.PGRP.refs)))

        From worker 2:  Any[(1,4)]

julia> # It exists.

       # Let us send this reference to a 3rd worker.
       Base.remote_do(3, x->nothing, rr)

julia> # Check which workers that supposed to hold references to this RemoteRef
       Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))

        From worker 2:  IntSet([1, 3])

julia> # 2 believes that 1 and 3 hold a reference

julia> # Clear locally and run gc()
       rr=nothing

julia> @everywhere gc()
julia> @everywhere gc()
julia> @everywhere gc()

julia> # 1 is cleared, but worker 2 believes that 3 continues to hold a reference
       Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))

      From worker 2:  IntSet([3])
julia>  

I have tracked it down to finalizers not being called on the RemoteRef. The finalizer sends a del_msg to the processes actually holding the value.

Finalizers are not being called for regular objects too, when they are serialized to a remote worker.

julia> addprocs(2)
2-element Array{Int64,1}:
 2
 3

julia> # creates workers with pids 2 and 3

       @everywhere begin

       function finalize_foo(f)
           v = f.foo
           @schedule println("FOO finalized $v")
       end

       type Foo
           foo
           Foo(x) = (f=new(x); finalizer(f, finalize_foo); f)
       end

       function Base.serialize(s::SerializationState, f::Foo)
           invoke(serialize, Tuple{SerializationState, Any}, s, f)
       end

       function Base.deserialize(s::SerializationState, t::Type{Foo})
           f = invoke(deserialize, Tuple{SerializationState, DataType}, s, t)
           Foo(myid())
       end

       end

julia> Base.remote_do(3, x->nothing, Foo(0))
RemoteRef(3,1,6)

julia> @everywhere gc()
FOO finalized 0

julia> @everywhere gc()
julia> @everywhere gc()

As can be seen, Foo was not finalized on worker 3.

cc: @carnaval , @JeffBezanson

amitmurthy commented 9 years ago

Some progress:

addprocs(2)
rr = RemoteRef(2)
put!(rr, :OK)
Base.remote_do(3, x->nothing, rr)
rr=nothing
@everywhere gc()
@everywhere gc()
Base.remote_do(2, ()->println(Base.PGRP.refs[(1,4)].clientset))

# Execute a dummy remote_do again. This collects the previous ref
Base.remote_do(3, myid)
@everywhere gc()
Base.remote_do(2, ()->println(Base.PGRP.refs))

The second remote_do results in the reference finally being collected.

I tried changing https://github.com/JuliaLang/julia/blob/dbe94d156bbb07f0c30af6b49a42ab09416f5df7/base/multi.jl#L838-L846 to

            elseif is(msg, :do)
                f = deserialize(r_stream)
                args = deserialize(r_stream)
                #print("got args: $args\n")
                let f=f, args=args
                    @schedule begin
                        run_work_thunk(RemoteValue(), ()->f(args...))
                        f = nothing
                        args = nothing
                    end
                end
                f = nothing
                args = nothing

but that doesn't help.

Do let blocks also keep references? How do we clear them?

amitmurthy commented 9 years ago

Simpler example:

function foo(rr)
    while true
        b=take!(rr)
        let b=b
            f = x->nothing
            @schedule ()->f(b)
            b = nothing
        end
        b=nothing
    end
end

rr = RemoteRef()
@schedule foo(rr)

put!(rr, ones(10^8));
gc()
gc()
gc()

A reference to the array is held till the loop is entered again, say by a put!(rr, :OK). The remote ref does not actually have a reference as evidenced by

julia> isready(rr)
false

julia> Base.PGRP.refs
Dict{Any,Any} with 3 entries:
  (1,0) => Base.RemoteValue(false,nothing,Condition(Any[Task (waiting) @0x00007f8d3187f850]),Condition(Any[]),IntSet([1]),0)
  (1,2) => Base.RemoteValue(false,nothing,Condition(Any[Task (waiting) @0x00007f8d325fb6c0]),Condition(Any[]),IntSet([1]),0)
  (1,1) => Base.RemoteValue(false,nothing,Condition(Any[]),Condition(Any[]),IntSet([1]),0)

Removing the let statement makes the problem go away.

carnaval commented 9 years ago

Forgot to add a comment : as we discussed yesterday this seems to be because the value is stored into a temporary gensym. @vtjnash ?

amitmurthy commented 7 years ago

@yuyichao this - https://github.com/JuliaLang/Distributed.jl/issues/25 - is still an issue. Any ideas?