JuliaParallel / ClusterManagers.jl

Other
235 stars 74 forks source link

Getting errors with Slurm #109

Closed affans closed 5 years ago

affans commented 5 years ago

I am getting the following errors on STDOUT from the workers when using Slurm:

==> job0000.out <==
MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549)CapturedException(MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549), Any[(setindex!(::Array{Tuple,1}, ::Symbol, ::Int64) at array.jl:583, 1), ((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

==> job0001.out <==
TypeError(:deserialize_module, "typeassert", Module, ===)CapturedException(TypeError(:deserialize_module, "typeassert", Module, ===), Any[((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

I can not make sense of the error and where its originating.

related discourse topic: https://discourse.julialang.org/t/there-is-a-bug-in-this-function-and-i-cant-figure-out-what-it-is/19150/4

@vchuravy your help would be greatly appreciated.

affans commented 5 years ago

This is not a ClusterManagers issue, and so can be closed. The related issue is posted on https://github.com/JuliaLang/julia/issues/30558