JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.67k stars 5.48k forks source link

Segfault for @everywhere importall/using #12558

Closed rened closed 9 years ago

rened commented 9 years ago

When using the following code (in a file funcd.jl):

  module funcd
  using Compat

  export len, range

  len(a) = length(a)
  len{T,N}(a::AbstractArray{T,N}) = size(a,N)

  import Base.range
  range(a) = 1:len(a)

  end

by running it with julia run.jl with a run.jl file containing:

addprocs(3)
@everywhere importall funcd

results in a segfault:

WARNING: replacing module funcd
WARNING: replacing module funcd
WARNING: replacing module funcd
exception on exception on WARNING: Method definition range(Any) in module funcd at /Users/rene/BTSync/code/git/funcd/funcd.jl:10 overwritten in module funcd at /Users/rene/BTSync/code/git/funcd/funcd.jl:10.
4: 3:
signal (11): Segmentation fault: 11
mtcache_hash_lookup at /Users/rene/local/devjulia/src/gf.c:152
jl_apply_generic at /Users/rene/local/devjulia/src/gf.c:1630
showerror at replutil.jl:72
showerror at replutil.jl:83
jlcall___showerror#160___21217 at  (unknown line)
jl_apply at /Users/rene/local/devjulia/src/gf.c:1658
julia_showerror_21216 at  (unknown line)
jlcall_showerror_21216 at  (unknown line)
showerror at replutil.jl:91
julia_showerror_21214 at  (unknown line)
jlcall_showerror_21214 at  (unknown line)
jl_apply at /Users/rene/local/devjulia/src/gf.c:1658
anonymous at client.jl:88
with_output_color at util.jl:330
jl_apply at /Users/rene/local/devjulia/src/gf.c:1658
display_error at client.jl:86
jl_apply at /Users/rene/local/devjulia/src/gf.c:1658
run_work_thunk at multi.jl:651
run_work_thunk at multi.jl:657

signal (11): Segmentation fault: 11
jlcall_run_work_thunk_21149 at  (unknown line)
jl_apply at /Users/rene/local/devjulia/src/gf.c:1658
anonymous at task.jl:11
jl_apply at /Users/rene/local/devjulia/src/task.c:233
mtcache_hash_lookup at /Users/rene/local/devjulia/src/gf.c:152
jl_apply_generic at /Users/rene/local/devjulia/src/gf.c:1630
showerror at replutil.jl:72
showerror at replutil.jl:83
jlcall___showerror#160___21221 at  (unknown line)
jl_apply at /Users/rene/local/devjulia/src/gf.c:1658
julia_showerror_21220 at  (unknown line)
jlcall_showerror_21220 at  (unknown line)
showerror at replutil.jl:91
julia_showerror_21218 at  (unknown line)
jlcall_showerror_21218 at  (unknown line)
jl_apply at /Users/rene/local/devjulia/src/gf.c:1658
anonymous at client.jl:88
with_output_color at util.jl:330

A git bisect points to 7207a8a43e076576d6d6a6161ac75d2ae3391a6e

commit 7207a8a43e076576d6d6a6161ac75d2ae3391a6e
Author: Amit Murthy <amit.murthy@gmail.com>
Date:   Wed Jun 10 21:20:12 2015 +0530

    added support for different topologies

When executing the following code directly (i.e. in the REPL):

  module funcd
  using Compat

  export len, range

  len(a) = length(a)
  len{T,N}(a::AbstractArray{T,N}) = size(a,N)

  import Base.range
 range(a) = 1:len(a)

 end
addprocs(3)
@everywhere importall funcd

the code passes. cc @amitmurthy

jakebolewski commented 9 years ago

Isn't this a is a dup of #12381?

rened commented 9 years ago

i dont' think so - the error occurs at a different place in the C code and the workaround in https://github.com/JuliaLang/julia/issues/12381#issuecomment-127182636 (adding sleep(0.5)) does not help.

This time the git bisect at least seems to point to a reasonable commit for causing this.

rened commented 9 years ago

ps: also happens for using instead of importall.

rened commented 9 years ago

This error occurs on OSX - I can't reproduce it on Linux.

rened commented 9 years ago

One last comment: it seems that @everywhere is no longer necessary for imports anyway? Everything works nicely when I omit @everywhere.

jakebolewski commented 9 years ago

sure the code will get loaded on the master process, but none of the workers should load the code (this also assumes a shared file system).

parpwhick commented 9 years ago

I get a similar error, but if I precompile the module with compilecache, then @everywhere using works correctly.

rened commented 9 years ago

@jakebolewski I thought so too, therefore the @everywhere. But this works (which I think did not work in the 0.3 / early 0.4 days):

julia> addprocs(3)
3-element Array{Int64,1}:
 2
 3
 4

julia> using JSON

julia> @fetchfrom 2 json(1)
"1"

So while defining a new function needs to look like @everywhere func() = "hi", otherwise it is not visible on the workers, loading modules seems to be across all processes now.

So basically, everything is usable, but it would still be good not to crash on a @everywhere import statement which is a no-op anyway?

amitmurthy commented 9 years ago

I can see it on Linux in the current master.

WARNING: replacing module funcd                                                                                                                                                          
WARNING: replacing module funcd                                                                                                                                                          
WARNING: Method definition range(Any) in module funcd at /tmp/funcd.jl:10 overwritten in module funcd at /tmp/funcd.jl:10.                                                               
WARNING: replacing module funcd                                                                                                                                                          
WARNING: Method definition range(Any) in module funcd at /tmp/funcd.jl:10 overwritten in module funcd at /tmp/funcd.jl:10.                                                               

signal (11): Segmentation fault                                                                                                                                                          
jl_object_id at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)                                                                                                   
unknown function (ip: 0x7f514d94cc78)                                                                                                                                                    
unknown function (ip: 0x7f514d952bb5)                                                                                                                                                    
unknown function (ip: 0x7f514d95b58c)                                                                                                                                                    
jl_apply_generic at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)                                                                                               
serialize at serialize.jl:414                                                                                                                                                            
jl_apply_generic at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)
serialize at serialize.jl:414
jl_apply_generic at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)
serialize at serialize.jl:414
jl_apply_generic at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)
serialize at serialize.jl:414
jl_apply_generic at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)
serialize at serialize.jl:414
jl_apply_generic at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)
send_msg_ at multi.jl:222
send_msg_now at multi.jl:173
jl_apply_generic at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)
deliver_result at multi.jl:805
jlcall_deliver_result_21311 at  (unknown line)
jl_apply_generic at /home/amitm/Work/julia/julia/usr/bin/../lib/libjulia.so (unknown line)
anonymous at task.jl:890
unknown function (ip: 0x7f514d9bf560)
unknown function (ip: (nil))

I suspect it is the same as #12381, specifically https://github.com/JuliaLang/julia/issues/12381#issuecomment-126816290

rened commented 9 years ago

Ok, true. So the only (perhaps) valueable info from this issue is that it does not occur before 7207a8a. But then again, perhaps this bisect is red herring, as well. Please feel free to close this issue when you think #12381 is enough for tracking this.

amitmurthy commented 9 years ago

Replacing addprocs(3) with addprocs(2) results in the following error printed (no segfault in this case):

WARNING: replacing module funcd
WARNING: replacing module funcd
WARNING: Method definition range(Any) in module funcd at /tmp/funcd.jl:10 overwritten in module funcd at /tmp/funcd.jl:10.
ERROR: LoadError: On worker 3:
LoadError("/tmp/funcd.jl",7,TypeError(:getfield,"",DataType,Any[:( # serialize.jl, line 400:),NewvarNode(:t),NewvarNode(:nf),NewvarNode(symbol("#s332")),:(tag = (Base.Serializer.sertag)(x::TypeError)::Int32),:( # line 401:),
:(unless (Base.slt_int)(0,(Base.box)(Int64,(Base.sext_int)(Int64,tag::Int32))::Int64)::Bool goto 0),:( # line 402:),:(GenSym(2) = (top(getfield))
(s::SerializationState{TCPSocket},:io)::TCPSocket),
:(unless (Base.slt_int)(tag::Int32,Base.Serializer.VALUE_TAGS)::Bool goto 15),:((Base.write)(GenSym(2),(top(vect))((Base.box)(UInt8,(Base.checked_trunc_uint)(UInt8,0))::UInt8)::Array{UInt8,1})::Int64),:(goto 15),:(15: ),:(return (Base.write)(GenSym(2),(top(vect))((Base.box)(UInt8,(Base.checked_trunc_uint)(UInt8,tag::Int32))::UInt8)::Array{UInt8,1})::Int64),:(0: ),:( # line 404:),:(t = (Base.Serializer.typeof)(x::TypeError)::Type{TypeError}),:( # line 405:),:(nf = (Base.Serializer.nfields)(t::Type{TypeError})::Int64),:( # line 406:),
:(unless nf::Int64 === 0::Bool goto 1),:(#s332 = (Base.slt_int)(0,(Base.box)(Int64,(Base.sext_int)(Int64,(top(getfield))(t::Type{TypeError},:size)::Int32))::Int64)::Bool),:(goto 2),:(1: ),:(#s332 = false),:(2: ),
:(unless #s332::Bool goto 3),:( # line 407:),:((Base.Serializer.serialize_type)
(s::SerializationState{TCPSocket},t::Type{TypeError})::Union{Int64,Void}),:( # line 408:),:(GenSym(3) = (top(getfield))
(s::SerializationState{TCPSocket},:io)::TCPSocket),:(return (Base.throw)($(Expr(:new, :((top(getfield))(Base,:MethodError)::Type{MethodError}), :(Base.write), :((top(tuple))(GenSym(3),x::TypeError)::Tuple{TCPSocket,TypeError}))))::Union{}),:(goto 12),:(3: ),:( # line 410:),
:(unless (top(getfield))(t::Type{TypeError},:mutable)::Bool goto 5),
:(unless (Base.Serializer.serialize_cycle)
(s::SerializationState{TCPSocket},x::TypeError)::Bool goto 4),:(return),:(4: ),:(goto 5),:(5: ),:( # line 411:),:((Base.Serializer.serialize_type)
(s::SerializationState{TCPSocket},t::Type{TypeError})::Union{Int64,Void}),:( # line 412:),:(GenSym(0) = $(Expr(:new, UnitRange{Int64}, 1, :(((top(getfield))(Base.Intrinsics,:select_value)::I)((Base.sle_int)(1,nf::Int64)::Bool,nf::Int64,(Base.box)(Int64,(Base.sub_int)(1,1))::Int64)::Int64)))),:(#s333 = (top(getfield))(GenSym(0),:start)::Int64),
:(unless (Base.box)(Base.Bool,(Base.not_int)(#s333::Int64 === (Base.box)(Base.Int,(Base.add_int)((top(getfield))(GenSym(0),:stop)::Int64,1))::Int64::Bool))::Bool goto 7),:(8: ),:(GenSym(5) = #s333::Int64),:(GenSym(6) = (Base.box)(Base.Int,(Base.add_int)(#s333::Int64,1))::Int64),:(i = GenSym(5)),:(#s333 = GenSym(6)),:( # line 413:),
:(unless (Base.Serializer.isdefined)(x::TypeError,i::Int64)::Bool goto 10),:( # line 414:),:((Base.Serializer.serialize)
(s::SerializationState{TCPSocket},(Base.Serializer.getfield)(x::TypeError,i::Int64))),:(goto 11),:(10: ),:( # line 416:),:(GenSym(4) = (top(getfield))
(s::SerializationState{TCPSocket},:io)::TCPSocket),:((Base.write)(GenSym(4),(top(vect))((Base.box)(UInt8,(Base.checked_trunc_uint)(UInt8,Base.Serializer.UNDEFREF_TAG))::UInt8)::Array{UInt8,1})::Int64),:(11: ),:(9: ),
:(unless (Base.box)(Base.Bool,(Base.not_int)((Base.box)(Base.Bool,(Base.not_int)(#s333::Int64 === (Base.box)(Base.Int,(Base.add_int)((top(getfield))(GenSym(0),:stop)::Int64,1))::Int64::Bool))::Bool))::Bool goto 8),:(7: ),:(6: ),:(return),:(12: )]))
 in include_string at loading.jl:225
 in include_from_node1 at ./loading.jl:266
 in require at ./loading.jl:202
 in eval at sysimg.jl:14
 in anonymous at multi.jl:1349
 in anonymous at multi.jl:889
 in run_work_thunk at multi.jl:642
 in anonymous at task.jl:889
 in remotecall_fetch at multi.jl:728
 in anonymous at task.jl:447
 in sync_end at ./task.jl:413
 in anonymous at multi.jl:422
 in include at ./boot.jl:254
 in include_from_node1 at ./loading.jl:263
 in process_options at ./client.jl:308
 in _start at ./client.jl:411
while loading /tmp/run.jl, in expression starting on line 3

I don't know how to interpret it. Does it help in identifying the cause of the segfault?

amitmurthy commented 9 years ago

FWIW, this is Linux on a macbookpro, so maybe the segfault has some relation to the hardware too?

  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
rened commented 9 years ago

mine is

  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
amitmurthy commented 9 years ago

This warning - "WARNING: replacing module funcd " means that it is being loaded twice....that is probably a pointer to what is going wrong.

rened commented 9 years ago

@amitmurthy I believe each import statement is executed on all workers? using X on master loads the package on all workers. The redundant @everywhere triggers loading once from each worker (in turn actually loading on all other workers as well). @everywhere seems to be completely redundant (and racy) for importing?

amitmurthy commented 9 years ago

Ah! OK.

rened commented 9 years ago

I can no longer reproduce this using current master (4d8ca6b), neither on OSX nor Linux.

alyst commented 9 years ago

It was never segfaulting for me, but with the very latest master (f3217a8) I still get similar exceptions when trying to do @everywhere on 12 workers:

ERROR: On worker 5:
LoadError("<...>",61,LoadError("<...>",4,LoadError("<...>",4,UndefVarError(:<...>))))
 in include_string at loading.jl:226
 in include_from_node1 at ./loading.jl:267
 in require at ./loading.jl:203
 in include_string at loading.jl:226
 in include_from_node1 at ./loading.jl:267
 in anonymous at no file:28
 in include_string at loading.jl:226
 in include_from_node1 at ./loading.jl:267
 in eval at ./sysimg.jl:14
 in anonymous at multi.jl:1348
 in anonymous at multi.jl:889
 in run_work_thunk at multi.jl:642
 in anonymous at task.jl:889
 in remotecall_fetch at multi.jl:728
 in remotecall_fetch at multi.jl:731
 in anonymous at multi.jl:1350