JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
29 stars 11 forks source link

Distributed.jl should verify Julia version between primary-worker #49

Open Naikless opened 2 years ago

Naikless commented 2 years ago

As described already here, running

using Distributed
addprocs(["<remotename>"], exename="<pathToRemoteJuliaExe>", dir="<pathToRemoteHomeDir>")

@fetch myid()

on a host with Julia 1.6.1 and connecting to a remote with Julia 1.7.2 leads to

ERROR: On worker 2:
TypeError: non-boolean (Nothing) used in boolean context
Stacktrace:
  [1] deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:1166
  [2] handle_deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:947
  [3] deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:801 [inlined]
  [4] deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:1018
  [5] handle_deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:947
  [6] deserialize_fillarray!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:1230
  [7] deserialize_array
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:1222
  [8] handle_deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:852
  [9] deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:801 [inlined]
 [10] deserialize_typename
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:1296
 [11] deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/clusterserialize.jl:68
 [12] handle_deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:947
 [13] deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:801
 [14] handle_deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:858
 [15] deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:801
 [16] handle_deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:861
 [17] deserialize
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Serialization/src/Serialization.jl:801 [inlined]
 [18] deserialize_msg
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/messages.jl:87
 [19] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [20] invokelatest
    @ ./essentials.jl:714 [inlined]
 [21] message_handler_loop
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:169
 [22] process_tcp_streams
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:126
 [23] JuliaLang/julia#99
    @ ./task.jl:423
Stacktrace:
 [1] #remotecall_fetch#143
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:394 [inlined]
 [2] remotecall_fetch(::Function, ::Distributed.Worker)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:386
 [3] remotecall_fetch(::Function, ::Int64; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:421
 [4] remotecall_fetch(::Function, ::Int64)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:421
 [5] top-level scope
   @ none:1

However, when I use Julia 1.7.2 instead of 1.6.1 on the host, everything works as expected. Both systems run CentOS 7.

vchuravy commented 2 years ago

The serialization format between versions is not stable and thus mixed version are not supported (see https://docs.julialang.org/en/v1/stdlib/Serialization/).

I think what would be a great addition is during initial connection to check that the Julia version is the same and otherwise error.

Naikless commented 2 years ago

Thanks for clearing this up!

At the very least this should be mentioned in the documentation of the Distributed module where I currently couldn’t find any hint about these kind of issues.

For someone not as familiar with the inner workings of the remote function calls this is especially confusing, because Julia in general claims to be mostly compatible for all 1.x versions.

giordano commented 2 years ago

because Julia in general claims to be mostly compatible for all 1.x versions.

The code you write is mostly backward-compatible. Communication between different processes, which is the problem here, is a different matter.

Naikless commented 2 years ago

I think what would be a great addition is during initial connection to check that the Julia version is the same and otherwise error.

It might be sufficient to improve the error message. Since https://github.com/JuliaLang/julia/pull/35376 introduced an explicit check for binaries coming from a Julia version higher than the local one, this could be extended to only allow the same version. However, the above conversation at least indicates that backward compatibility should be expected.

The code you write is mostly backward-compatible. Communication between different processes, which is the problem here, is a different matter.

Yes, I see that now. However, I feel this is not reflected sufficiently in the documentation, so I filed PR https://github.com/JuliaLang/julia/pull/45368 to improve it.

giordano commented 2 years ago

Maybe instead of (or in addition to?) the Julia version we should check the serialization format version? https://github.com/JuliaLang/julia/blob/0f2ed77dca88785c9ae0fb1cf1a77593d1527c18/stdlib/Serialization/src/Serialization.jl#L82

Naikless commented 2 years ago

As I said, that check already exists for future versions: https://github.com/JuliaLang/julia/blob/0f2ed77dca88785c9ae0fb1cf1a77593d1527c18/stdlib/Serialization/src/Serialization.jl#L742

I would probably either

davpayne commented 1 year ago

Does this still need to be worked since JuliaLang/julia#45368 is merged? If so, I'm thinking of just adding an elseif warning to Serialization.jl for a less than version to put in practice what @Naikless suggested

Naikless commented 1 year ago

My PR only addressed the documentation. If the checks are still the same, I believe this could still improve error identification.