JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.72k stars 5.48k forks source link

Segfault: sgemm_itcopy_SKYLAKEX at julia/libopenblas64_.so (unknown line) #52154

Open kerim371 opened 12 months ago

kerim371 commented 12 months ago

Hi,

I use third party library and it uses LinearAlgebra and BLAS. I happen to pseudo-randomly encounter a segmentation fault:

 From worker 2:    [1781] signal (11.1): Segmentation fault
      From worker 2:    in expression starting at none:0
      From worker 2:    sgemm_itcopy_SKYLAKEX at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 2:    sgemm_nn at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 2:    sgemm_64_ at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 2:    gemm! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/blas.jl:1524
      From worker 2:    gemm_wrapper! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:674
      From worker 2:    mul! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:161 [inlined]
      From worker 2:    mul! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:276 [inlined]
      From worker 2:    * at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:148 [inlined]
      From worker 2:    SincInterpolation at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:553
      From worker 2:    macro expansion at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:527 [inlined]
      From worker 2:    macro expansion at ./timing.jl:393 [inlined]
      From worker 2:    macro expansion at /home/kerim/.julia/packages/JUDI/JEsVr/src/JUDI.jl:141 [inlined]
      From worker 2:    time_resample at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:523
      From worker 2:    time_resample at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:547 [inlined]
      From worker 2:    post_process at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/time_modeling_serial.jl:61
      From worker 2:    unknown function (ip: 0x7f4dfa10e672)
      From worker 2:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 2:    time_modeling at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/time_modeling_serial.jl:52
      From worker 2:    unknown function (ip: 0x7f4dfa0dcdb8)
      From worker 2:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 2:    propagate at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/propagation.jl:9
      From worker 2:    unknown function (ip: 0x7f4e52362366)
      From worker 2:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 2:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 2:    jl_f__call_latest at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:774
      From worker 2:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 2:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 2:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 2:    #invokelatest#2 at ./essentials.jl:819
      From worker 2:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 2:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 2:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 2:    invokelatest at ./essentials.jl:816
      From worker 2:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 2:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 2:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 2:    #107 at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:281
      From worker 2:    run_work_thunk at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
      From worker 2:    unknown function (ip: 0x7f4e52360769)
      From worker 2:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 2:    run_work_thunk at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:79
      From worker 2:    #100 at ./task.jl:514
      From worker 2:    unknown function (ip: 0x7f4e5236032f)
      From worker 2:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 2:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 2:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 2:    start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
      From worker 2:    Allocations: 150647639 (Pool: 150584661; Big: 62978); GC: 454
      From worker 3:    Operator `forward` ran in 3.55 s
Worker 2 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#715")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:947
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:955
 [3] unsafe_read
   @ ./io.jl:761 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:760
 [5] read!
   @ ./io.jl:762 [inlined]
 [6] deserialize_hdr_raw
   @ ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:514

The problem seems to be similar to that one #43309.

I've tried to set @everywhere BLAS.set_num_threads(1) after the thirdd party library is included and sometimes this helps and sometimes not.

I do the calculations on the cloud with one master node and 4 computationla nodes. Each computational node has 4 cores (Intel Ica Lake).

versioninfo() output:

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel Xeon Processor (Icelake)
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
  Threads: 1 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /home/kerim/shared_app/python/python-3.9/lib:/home/kerim/shared_app/gcc/10.2/lib64:/home/kerim/shared_app/python/python-3.9/lib:/home/kerim/shared_app/gcc/10.2/lib64:/home/kerim/shared_app/python/python-3.9/lib:/home/kerim/shared_app/gcc/10.2/lib64:

I installed julia LTS 1.6.7 with jill and updated it with using UpdateJulia; update_julia().

The error is very annoying :( Computation takes much time and then sudden crush...

Appreciate any help how to solve this.

giordano commented 11 months ago

The problem seems to be similar to that one #43309.

What makes you think so? The only thing in common is a segmentation fault in the same external library, but the function triggering the segmentation fault is completely different.

There's next to nothing we can do about this if you don't provide a reliable reproducer, but this is very likely an upstream bug in OpenBLAS, so there's even less we can do about it. Since you're using an Intel CPU, you might want to try MKL.jl to use MKL instead of OpenBLAS for running BLAS operations.

kerim371 commented 11 months ago

@giordano hi,

As segfault is caused by the same libopenblas I thought these two cases may be connected.

There's next to nothing we can do about this if you don't provide a reliable reproducer,

I understand :( BLAS is used by third party library and it is not easy to reproduce. Also this segfault is absolutely random. Sometimes the it happen and sometimes not.

But anyway thank you for MKL.jl probably this will help.

PallHaraldsson commented 11 months ago

You can also consider BLISBLAS.jl. I think Julia should drop OpenBLAS (not because of your bug), and only provide generic matmul (etc., which would have "fixed" your bug) until you opt into BLISBLAS or whatever, possibly should bundle it with.

kerim371 commented 11 months ago

@PallHaraldsson thank you for suggestion! I hope this will help