JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

`Distributed.send_msg()` silently drops messages containing methods not defined on workers #58

Open staticfloat opened 5 years ago

staticfloat commented 5 years ago

Example:

using Distributed
import Distributed: worker_from_id, MsgHeader, RemoteDoMsg, send_msg
w = worker_from_id(addprocs(1)[1])

# Put `@everwhere` before this to get it to work
print_message(args...) = println(args...)

# This works
send_msg(w, MsgHeader(), RemoteDoMsg(println, ("hello!",), ()))

# This doesn't
send_msg(w, MsgHeader(), RemoteDoMsg(print_message, ("hello!",), ()))

If you put @everywhere before the definition of print_message then everything works; but if you don't, there is no error; the worker just silently ignores it. I would expect an error to tell me what went wrong. I'm not sure at what point things are going wrong; it does not appear to be a runtime error, as this occurs even when I am sending a message to call a function that is defined, but it is supposed to invoke a callback that may not be defined. Example:

using Distributed
import Distributed: worker_from_id, MsgHeader, RemoteDoMsg, send_msg
w = worker_from_id(addprocs(1)[1])

@everywhere function do_work(callback, args...)
    @info("Within do_work!")
    callback(args...)
end

# Put `@everwhere` before this to get it to work
print_message(args...) = println(args...)

# This works
send_msg(w, MsgHeader(), RemoteDoMsg(do_work, (println, "hello!",), ()))

# This doesn't
send_msg(w, MsgHeader(), RemoteDoMsg(do_work, (print_message, "hello!",), ()))

Notice how the "Within do_work" message is not emitted in the second case. My best guess is that something in the serialization code is throwing an error and that error is getting swallowed, but I haven't had time to debug this fully.

affans commented 5 years ago

My related? issue here: https://github.com/JuliaLang/julia/issues/30558 In my case, the error was that Julia versions were different from head node and compute nodes which the error message correctly identifies, but what's printed on screen (and stdout) is vastly different.

staticfloat commented 5 years ago

In this case the versions of Julia are the same, so I don’t think these two issues are related.

affans commented 5 years ago

I should really edit that issue. I was trying to highlight that the error I was supposed to be seeing error("Version read failed. Connection closed by peer.") was getting swallowed somewhere in the sequence of syncs and asyncs … which by the time I saw the error was completely uninformative to the actual issue.

Reading your issue again, it does seem its not related. But from discourse discussions, it does seem that Distributed has a few of these of "disappearing" problems.

amitmurthy commented 5 years ago

This is because remote_do does not return any response to the caller. In your example above:

remotecall will result in an error when the response is fetched unlike in the case of remote_do.

julia> fetch(remotecall(print_message, 2, ("dd",)))
ERROR: On worker 2:
UndefVarError: #print_message not defined
deserialize_datatype at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:1051
handle_deserialize at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:743
deserialize at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:703

Regular errors in the case of remote_do are printed to stderr as mentioned in the doc - https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.remote_do-Tuple{Any,Integer,Vararg{Any,N}%20where%20N} .

julia> remote_do(()->error("dd"), 2)
Any[]

julia>       From worker 2: dd
      From worker 2:    error(::String) at ./error.jl:33
      From worker 2:    (::getfield(Serialization.__deserialized_types__, Symbol("##29#30")))() at ./REPL[97]:1
      From worker 2:    (::getfield(Distributed, Symbol("##120#122")){Distributed.RemoteDoMsg})() at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:313
      From worker 2:    run_work_thunk(::getfield(Distributed, Symbol("##120#122")){Distributed.RemoteDoMsg}, ::Bool) at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:79

However, it appears that deserialization errors are not being printed to stderr for remote_do. It should have the same behavior as for regular errors.