JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

Distributed remote references + precompilation #53

Open quinnj opened 6 years ago

quinnj commented 6 years ago

Currently remote refs (RemoteChannel, Future) are a bit of a silent killer if used in precompiled module. Though our precompilation docs spell out in great detail the various constructors/patterns that should be avoided, nothing is mentioned of remote refs.

Digging into the implementation, however, reveals that remote refs use a global counter for self-identifying. It seems there's also likely to be issues with the worker id of the remote ref, but I didn't personally run into that (in my use-case, I was always creating RemoteChannels on pid 1 as "config variables" that all other worker processes could reference).

The problem identifies itself by producing invalid results: e.g. the first Future on pid 1 created at runtime will probably match the exact RRID of a precompiled RemoteChannel and hence isready, wait, fetch return results of the precompiled RemoteChannel instead of the Future. Quite a lovely surprise!

Obviously we want to avoid this situation of things seeming completely broken, so if that involves throwing explicit errors when precompiling a module w/ global RemoteChannel or Future variables, that seems safest to me. I'm not aware of all the precompilation magic that is available though in the case that we could actually make this work.

Ultimately, I think something more than just docs here.

iamed2 commented 5 years ago

Hmm, this sounds like the RemoteRef is being serialized using a ClusterSerializer. See https://github.com/JuliaLang/julia/pull/22836

I would expect the RRID to be 0 and for operations on the deserialized RemoteRef to fail.

Here's how this functions on non-ClusterSerializer Serializers on 0.6:

julia [invenia]> serialize(io, Future())
ERROR: ArgumentError: elements of IntSet must be between 1 and typemax(Int)
Stacktrace:
 [1] _throw_intset_bounds_err() at ./intset.jl:64
 [2] push! at ./intset.jl:68 [inlined]
 [3] (::Base.Distributed.##133#134{Base.Distributed.RRID,Int64})() at ./distributed/remotecall.jl:249
 [4] lock(::Base.Distributed.##133#134{Base.Distributed.RRID,Int64}, ::Base.Threads.RecursiveTatasLock) at ./lock.jl:101
 [5] add_client at ./distributed/remotecall.jl:247 [inlined]
 [6] send_add_client(::Future, ::Int64) at ./distributed/remotecall.jl:262
 [7] serialize(::SerializationState{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Future, ::Bool) at ./distributed/remotecall.jl:281
 [8] serialize(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Future) at ./serialize.jl:630

and 0.7:

julia> using Serialization

julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> serialize(io, Future())
WARNING: Base.Future is deprecated: it has been moved to the standard library package `Distributed`.
Add `using Distributed` to your imports.
 in module Main

julia> seekstart(io)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=54, maxsize=Inf, ptr=1, mark=-1)

julia> deserialize(io)
Distributed.Future(0, 0, 0, nothing)

julia> fetch(ans)
ERROR: no process with id 0 exists
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] worker_from_id(::Distributed.ProcessGroup, ::Int64) at /Users/ericdavies/repos/juliamaster/usr/share/julia/stdlib/v0.7/Distributed/src/cluster.jl:913
 [3] worker_from_id at /Users/ericdavies/repos/juliamaster/usr/share/julia/stdlib/v0.7/Distributed/src/cluster.jl:905 [inlined]
 [4] #remotecall_fetch#152 at /Users/ericdavies/repos/juliamaster/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:392 [inlined]
 [5] remotecall_fetch at /Users/ericdavies/repos/juliamaster/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:392 [inlined]
 [6] call_on_owner at /Users/ericdavies/repos/juliamaster/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:465 [inlined]
 [7] fetch(::Distributed.Future) at /Users/ericdavies/repos/juliamaster/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:497
 [8] top-level scope at none:0