Open staticfloat opened 5 years ago
My related? issue here: https://github.com/JuliaLang/julia/issues/30558 In my case, the error was that Julia versions were different from head node and compute nodes which the error message correctly identifies, but what's printed on screen (and stdout) is vastly different.
In this case the versions of Julia are the same, so I don’t think these two issues are related.
I should really edit that issue. I was trying to highlight that the error I was supposed to be seeing error("Version read failed. Connection closed by peer.")
was getting swallowed somewhere in the sequence of syncs
and asyncs
… which by the time I saw the error was completely uninformative to the actual issue.
Reading your issue again, it does seem its not related. But from discourse discussions, it does seem that Distributed
has a few of these of "disappearing" problems.
This is because remote_do
does not return any response to the caller. In your example above:
remotecall
will result in an error when the response is fetched unlike in the case of remote_do
.
julia> fetch(remotecall(print_message, 2, ("dd",)))
ERROR: On worker 2:
UndefVarError: #print_message not defined
deserialize_datatype at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:1051
handle_deserialize at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:743
deserialize at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:703
Regular errors in the case of remote_do
are printed to stderr
as mentioned in the doc - https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.remote_do-Tuple{Any,Integer,Vararg{Any,N}%20where%20N} .
julia> remote_do(()->error("dd"), 2)
Any[]
julia> From worker 2: dd
From worker 2: error(::String) at ./error.jl:33
From worker 2: (::getfield(Serialization.__deserialized_types__, Symbol("##29#30")))() at ./REPL[97]:1
From worker 2: (::getfield(Distributed, Symbol("##120#122")){Distributed.RemoteDoMsg})() at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:313
From worker 2: run_work_thunk(::getfield(Distributed, Symbol("##120#122")){Distributed.RemoteDoMsg}, ::Bool) at /Users/amitm/Julia/julia/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:79
However, it appears that deserialization errors are not being printed to stderr for remote_do
. It should have the same behavior as for regular errors.
Example:
If you put
@everywhere
before the definition ofprint_message
then everything works; but if you don't, there is no error; the worker just silently ignores it. I would expect an error to tell me what went wrong. I'm not sure at what point things are going wrong; it does not appear to be a runtime error, as this occurs even when I am sending a message to call a function that is defined, but it is supposed to invoke a callback that may not be defined. Example:Notice how the
"Within do_work"
message is not emitted in the second case. My best guess is that something in the serialization code is throwing an error and that error is getting swallowed, but I haven't had time to debug this fully.