JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.71k stars 5.49k forks source link

Segmentation fault with Distributed when --threads is set #54253

Open Socob opened 6 months ago

Socob commented 6 months ago

I’m getting segmentation faults when using Distributed while passing --threads to Julia, even when I’m not actually using any of those threads (see the MWE below). Needless to say, this is a huge problem when doing hybrid distributed- and shared-memory parallelization!

$ julia test.jl
start
      From worker 12: 
      From worker 12: [58424] signal (11.1): Segmentation fault
      From worker 12: in expression starting at none:1
      From worker 12: Allocations: 101999211 (Pool: 93311196; Big: 8688015); GC: 1591
Worker 12 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:947
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:955
 [3] unsafe_read
   @ ./io.jl:774 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:773
 [5] read!
   @ ./io.jl:775 [inlined]
 [6] deserialize_hdr_raw
   @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
ERROR: LoadError: ProcessExitedException(12)
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:448
 [2] macro expansion
   @ ./task.jl:480 [inlined]
 [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/macros.jl:219
 [4] macro expansion
   @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/macros.jl:203 [inlined]
 [5] main()
   @ Main ~/test.jl:9
 [6] top-level scope
   @ ~/keeper/Documents/docs/postdocs/work/parity_violation/analytic4PC/run3.jl:30
in expression starting at ~/test.jl:29

Using the commented line instead (without --threads), I’m not getting any segmentation faults.

Triggering the segfault does seem to depend on the number of worker processes, in that with a small number of workers, the issue is not triggered (or at least not consistently). It also doesn’t appear immediately, but after some non-deterministic time. The details may be machine-specific, but I’ve reproduced this on several different machines.

I don’t have any attempts at an explanation, since I don’t see how merely setting the number of Julia threads would affect this code.


  1. The output of versioninfo():
    Julia Version 1.10.2
    Commit bd47eca2c8a (2024-03-01 10:14 UTC)
    Build Info:
      Official https://julialang.org/ release
    Platform Info:
      OS: Linux (x86_64-linux-gnu)
      CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
      WORD_SIZE: 64
      LIBM: libopenlibm
      LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
    Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
  2. How you installed Julia: juliaup
  3. A minimal working example (MWE), also known as a minimum reproducible example:

    using Distributed
    
    function main()
        arr = zeros(1000, 10000)
        arr .= 1.0
        println("start"); flush(stdout)
        @everywhere workers() begin
            # dummy calculation
            arr = $arr
            for i in 1:size(arr, 2)
                sum(
                    sum(1.1 .* @view arr[:, i])
                    for _ in 1:5000
                )
            end
        end
        println("DONE"); flush(stdout)
    end
    
    addprocs(
        15;
        # results in segfault
        exeflags=`--startup-file=no --threads=16`
        # no segfault!
    #    exeflags=`--startup-file=no`
    )
    main()
Socob commented 6 months ago

Sorry for the edits, but right after creating this issue I thought I’d also observed this without using SharedArrays. However, I can’t reproduce that right now…

Socob commented 6 months ago

OK, it’s definitely happening even for a normal array, so this has nothing to do with SharedArrays! That makes it much worse!

danspielman commented 5 months ago

I am having a similar error. My code runs fine on a Mac, but gives me errors like this when I run it on a linux cluster. I’ll try to construct an MFE — minimal failing example.

LHerviou commented 5 months ago

I am also having similar issues on a linux cluster (code runs fine on my old julia install on the laptop). I put some data in https://discourse.julialang.org/t/segmentation-fault-using-multithreaded-julia-on-new-server/114557 Not able to get a clean MFE.