JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
44.94k stars 5.42k forks source link

thread seg fault with error "concurrency violation detected" #31702

Closed EthanAnderes closed 5 years ago

EthanAnderes commented 5 years ago

Totally confused by the segmentation fault I'm getting on dev branch. Started seeing it perhaps a couple weeks back but had a hard time getting a small test case that would reliably trigger the error.

The following snippet works fine if threads are off and works v1.1 with threads on or off. For some reason importing PyCall seems important but I have no idea why (BTW: I have PyCall v1.91.2).

Can anybody else reproduce this?

Threads.nthreads() # <-- 4
import PyCall
abstract type Flat end
struct X{P<:Flat, T<:Real}
    t::Array{T,1}  
end
function foo(x::X{P,T}) where {P<:Flat, T<:Real}
    v = zeros(T, size(x.t))
    Threads.@threads for i=1:length(x.t)
        v[i] = x.t[i]^2 + sin(x.t[i]) - cos(x.t[i])
    end
    return v
end
T = Float64
x = X{Flat,T}(rand(T,1000))
@time foo(x) #<--- Segmentation fault

Here is the error I get

julia> @time foo(x) #<--- Segmentation fault

  fatal: error thrown and no exception handler available.
ErrorException("concurrency violation detected")
0.203108rec_backtrace at /Users/ethananderes/Software/juliaMaster/src/stackwalk.c:94
record_backtrace at /Users/ethananderes/Software/juliaMaster/src/task.c:210
jl_throw at /Users/ethananderes/Software/juliaMaster/src/task.c:417
 seconds (error at ./error.jl:33
assert_havelock at ./condition.jl:20 [inlined]
assert_havelock at ./condition.jl:43 [inlined]
assert_havelock at ./condition.jl:67 [inlined]
notify at ./condition.jl:118
#notify#463 at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
readcb_specialized at ./stream.jl:558
jfptr_readcb_specialized_4955 at /Users/ethananderes/Software/juliaMaster/usr/lib/julia/sys.dylib (unknown line)
528.70uv_readcb at ./stream.jl:599
 kunknown function (ip: 0x1119208a4)
 allocations: 26.948 MiB, 3.00% gc time)jlcapi_uv_readcb_4173_gfthunk at /Users/ethananderes/Software/juliaMaster/usr/lib/julia/sys.dylib (unknown line)
jlcapi_uv_readcb_4173 at /Users/ethananderes/Software/juliaMaster/usr/lib/julia/sys.dylib (unknown line)

uv__read at /workspace/srcdir/libuv/src/unix/stream.c:1179
uv__stream_io at /workspace/srcdir/libuv/src/unix/stream.c:1339
uv__io_poll at /workspace/srcdir/libuv/src/unix/kqueue.c:311
uv_run at /workspace/srcdir/libuv/src/unix/core.c:361
jl_task_get_next at /Users/ethananderes/Software/juliaMaster/src/partr.c:303
poptaskref at ./task.jl:564
wait at ./task.jl:591
task_done_hook at ./task.jl:327
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_finish_task at /Users/ethananderes/Software/juliaMaster/src/task.c:167
start_task at /Users/ethananderes/Software/juliaMaster/src/task.c:593
1000-element Array{Float64,1}:

signal (11): Segmentation fault: 11
in expression starting at REPL[8]:0
Segmentation fault: 11
julia> versioninfo()
Julia Version 1.3.0-DEV.8
Commit 20834c3176* (2019-04-12 07:18 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_FFTW_PROVIDER = MKL
JeffBezanson commented 5 years ago

Please try #31709 and see if that fixes it.

Jutho commented 5 years ago

Probably the same as I reported here: https://github.com/JuliaLang/julia/pull/30899 which I believed was the PR that caused this in the first place. Great to see that a fix is around the corner.

EthanAnderes commented 5 years ago

Ok, so jb/fix31702 fixed that snippit for me, but my original code had the same seg fault. Took me a while but I found I can reliably trigger it again by adding using PyPlot in the snippit below.

Threads.nthreads() # <--- 4
using PyCall
using PyPlot       # <--- need to add this to get seg fault on jb/fix31702/6dbd5ec44f
abstract type Flat end
struct X{P<:Flat, T<:Real}
    t::Array{T,1}  
end
function foo(x::X{P,T}) where {P<:Flat, T<:Real}
    v = zeros(T, size(x.t))
    Threads.@threads for i=1:length(x.t)
        v[i] = x.t[i]^2 + sin(x.t[i]) - cos(x.t[i])
    end
    return v
end
T = Float64
x = X{Flat,T}(rand(T,1000))
@time foo(x) # <--- segmentation fault
EthanAnderes commented 5 years ago

Just to record what I'm still seeing I'll post it here. If no one else can reporduce it I guess something is weird on my end. Anyway, thanks for looking into it so quickly.

julia> versioninfo()
Julia Version 1.3.0-DEV.22
Commit c4841ca6e0* (2019-04-13 18:26 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_FFTW_PROVIDER = MKL

julia> Threads.nthreads() # <--- 4
4

julia> using PyCall
[ Info: Recompiling stale cache file /Users/ethananderes/.julia/compiled/v1.3/PyCall/GkzkC.ji for PyCall [438e738f-606a-5dbb-bf0a-cddfbfd45ab0]

julia> using PyPlot       # <--- need to add this to get seg fault on jb/fix31702/6dbd5ec44f
[ Info: Recompiling stale cache file /Users/ethananderes/.julia/compiled/v1.3/PyPlot/oatAj.ji for PyPlot [d330b81b-6aea-500a-939a-2ce795aea3ee]

julia> abstract type Flat end

julia> struct X{P<:Flat, T<:Real}
           t::Array{T,1}  
       end

julia> function foo(x::X{P,T}) where {P<:Flat, T<:Real}
           v = zeros(T, size(x.t))
           Threads.@threads for i=1:length(x.t)
               v[i] = x.t[i]^2 + sin(x.t[i]) - cos(x.t[i])
           end
           return v
       end
foo (generic function with 1 method)

julia> T = Float64
Float64

julia> x = X{Flat,T}(rand(T,1000))
X{Flat,Float64}([0.7245885961981229, 0.3383104857554391, 0.676555825916445, 0.558426289965362, 0.5530476330156351, 0.37882649143300196, 0.897409874921808, 0.8964334927209676, 0.29800859872573726, 0.5404695098549857  …  0.8914687582775509, 0.6738835181174165, 0.903603579069667, 0.44743595337482334, 0.2609432922325392, 0.26049829407694514, 0.8903264779207407, 0.8244344549248652, 0.2042527657723534, 0.7499361443047876])

julia> @time foo(x) # <--- segmentation fault
fatal: error thrown and no exception handler available.
ErrorException("concurrency violation detected")
rec_backtrace at /Users/ethananderes/Software/juliaMaster/src/stackwalk.c:94
record_backtrace at /Users/ethananderes/Software/juliaMaster/src/task.c:210
jl_throw at /Users/ethananderes/Software/juliaMaster/src/task.c:417
error at ./error.jl:33
assert_havelock at ./condition.jl:20 [inlined]
assert_havelock at ./condition.jl:43 [inlined]
assert_havelock at ./condition.jl:67 [inlined]
notify at ./condition.jl:118
#notify#463 at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
uv_timercb at ./asyncevent.jl:148
jlcapi_uv_timercb_4215 at /Users/ethananderes/Software/juliaMaster/usr/lib/julia/sys.dylib (unknown line)
uv__run_timers at /workspace/srcdir/libuv/src/timer.c:162
  uv_run at /workspace/srcdir/libuv/src/unix/core.c:352
jl_task_get_next at /Users/ethananderes/Software/juliaMaster/src/partr.c:303
poptaskref at ./task.jl:564
0.249855wait at ./task.jl:591
task_done_hook at ./task.jl:327
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_finish_task at /Users/ethananderes/Software/juliaMaster/src/task.c:167
 secondsstart_task at /Users/ethananderes/Software/juliaMaster/src/task.c:593
 (529.47 k allocations: 27.174 MiB, 6.03% gc time)
Segmentation fault: 11
vchuravy commented 5 years ago

Ethan, is this with Jeff's fix?

EthanAnderes commented 5 years ago

Yea, I'm able to get the seg fault even with Jeff's fix, but needed to add using PyPlot as above to see it semi-regularly.

EthanAnderes commented 5 years ago

Here is a reduced test case that still crashes often.

function foo(x::Vector{T}) where T
    v = fill(T(0), length(x))
    Threads.@threads for i=1:length(x)
        v[i] = exp(x[i])
    end
    v
end
x = rand(1000)

import PyPlot
foo(x) 

It is quite finicky and seems to depend on loading PyPlot before first execution of foo. Was worried it had something to do with my system Anaconda distribution of python but I can trigger the crash using the default miniconda install of PyPlot.jl. However with miniconda I get a little more printout of the error

julia> foo(x)
fatal: error thrown and no exception handler available.
ErrorException("concurrency violation detected")
rec_backtrace at /Users/ethananderes/Software/juliaMaster/src/stackwalk.c:94
record_backtrace at /Users/ethananderes/Software/juliaMaster/src/task.c:210
jl_throw at /Users/ethananderes/Software/juliaMaster/src/task.c:417
error at ./error.jl:33
assert_havelock at ./condition.jl:20 [inlined]
assert_havelock at ./condition.jl:43 [inlined]
assert_havelock at ./condition.jl:67 [inlined]
notify at ./condition.jl:118
#notify#463 at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
fatal: error thrown and no exception handler available.
uv_timercb at ./asyncevent.jl:148
ErrorException("concurrency violation detected")
jlcapi_uv_timercb_4210 at /Users/ethananderes/Software/juliaMaster/usr/lib/julia/sys.dylib (unknown line)
rec_backtrace at /Users/ethananderes/Software/juliaMaster/src/stackwalk.c:94
uv__run_timers at /workspace/srcdir/libuv/src/timer.c:162
record_backtrace at /Users/ethananderes/Software/juliaMaster/src/task.c:210
uv_run at /workspace/srcdir/libuv/src/unix/core.c:352
jl_throw at /Users/ethananderes/Software/juliaMaster/src/task.c:417
jl_task_get_next at /Users/ethananderes/Software/juliaMaster/src/partr.c:303
error at ./error.jl:33
poptaskref at ./task.jl:564
assert_havelock at ./condition.jl:20 [inlined]
assert_havelock at ./condition.jl:43 [inlined]
assert_havelock at ./condition.jl:67 [inlined]
notify at ./condition.jl:118
wait at ./task.jl:591
#notify#463 at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
uv_timercb at ./asyncevent.jl:148
task_done_hook at ./task.jl:327
jlcapi_uv_timercb_4210 at /Users/ethananderes/Software/juliaMaster/usr/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_finish_task at /Users/ethananderes/Software/juliaMaster/src/task.c:167
uv__run_timers at /workspace/srcdir/libuv/src/timer.c:162
jl_threadfun at /Users/ethananderes/Software/juliaMaster/src/partr.c:217
uv_run at /workspace/srcdir/libuv/src/unix/core.c:375
jl_task_get_next at /Users/ethananderes/Software/juliaMaster/src/partr.c:303
_pthread_body at /usr/lib/system/libsystem_pthread.dylib (unknown line)
_pthread_start at /usr/lib/s_pthread_start at /usr/lib/system/libsystem_pthread.dylib (unknown line)
wait at ./task.jl:591
task_done_hook at ./task.jl:327
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_finish_task at /Users/ethananderes/Software/juliaMaster/src/task.c:167
start_task at /Users/ethananderes/Software/juliaMaster/src/task.c:593

signal (11): Segmentation fault: 11
in expression starting at REPL[4]:1
_PyMethodDef_RawFastCallKeywords at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
_PyMethodDescr_FastCallKeywords at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
call_function at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
_PyEval_EvalFrameDefault at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
function_code_fastcall at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
call_function at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
_PyEval_EvalFrameDefault at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
function_code_fastcall at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
call_function at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
_PyEval_EvalFrameDefault at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
function_code_fastcall at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
atexit_callfuncs at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
Py_FinalizeEx at /Users/ethananderes/.julia/conda/3/lib/libpython3.7m.dylib (unknown line)
Py_Finalize at /Users/ethananderes/.julia/packages/PyCall/ttONZ/src/pyinit.jl:125
_atexit at ./initdefs.jl:309
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_atexit_hook at /Users/ethananderes/Software/juliaMaster/src/init.c:243
jl_exit at /Users/ethananderes/Software/juliaMaster/src/jl_uv.c:613
jl_no_exc_handler at /Users/ethananderes/Software/juliaMaster/src/task.c:369
jl_finish_task at /Users/ethananderes/Software/juliaMaster/src/task.c:170
jl_threadfun at /Users/ethananderes/Software/juliaMaster/src/partr.c:217
_pthread_body at /usr/lib/system/libsystem_pthread.dylib (unknown line)
_pthread_start at /usr/lib/system/libsystem_pthread.dylib (unknown line)
Allocations: 17932081 (Pool: 17928490; Big: 3591); GC: 40
Segmentation fault: 11

Not sure if that helps. Here is my versioninfo

julia> versioninfo()
Julia Version 1.3.0-DEV.23
Commit 1c88c0e2be* (2019-04-13 20:05 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_FFTW_PROVIDER = FFTW

(v1.3) pkg> st PyPlot
    Status `~/.julia/environments/v1.3/Project.toml`
  [438e738f] PyCall v1.91.2
  [d330b81b] PyPlot v2.8.1  
EthanAnderes commented 5 years ago

I'm trying to do a bit of work to help pin down which commit triggers the error. Doing some pattern matching from the error and commit messages (also tracing back to about the time I started seeing the seg faults). Here is what I've got so far...hope this helps

No error: small fixes to libuv locks (#31454)

git checkout 1be1deb10c90b612a57d3541d3bf1aae1a42e104

No error but some weird errors during compilation: appveyor: fix binary-builder-based builds (#31592)

git checkout a86aab3777d07b00404659f67f35ff2b3615ed29

Sysimage built. Summary:
Total ───────  84.667477 seconds 
Base: ───────  28.926627 seconds 34.165%
Stdlibs: ────  55.738737 seconds 65.8325%
    JULIA usr/lib/julia/sys-o.a
Internal error: encountered unexpected error in runtime:
UndefVarError(var=:li)
rec_backtrace at /Users/ethananderes/Software/juliaMaster/src/stackwalk.c:94
record_backtrace at /Users/ethananderes/Software/juliaMaster/src/task.c:210
jl_throw at /Users/ethananderes/Software/juliaMaster/src/task.c:417
jl_undefined_var_error at /Users/ethananderes/Software/juliaMaster/src/rtutils.c:130
jl_get_binding_or_error at /Users/ethananderes/Software/juliaMaster/src/module.c:290
_uncompressed_ast at ./reflection.jl:906
typeinf_ext at ./compiler/typeinfer.jl:560
typeinf_ext at ./compiler/typeinfer.jl:599
unknown function (ip: 0x10d7c0416)
jl_apply_generic at /Users/ethananderes/Software/juliaMaster/src/gf.c:2191 [inlined]
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_type_infer at /Users/ethananderes/Software/juliaMaster/src/gf.c:207
jl_compile_method_internal at /Users/ethananderes/Software/juliaMaster/src/gf.c:1773
jl_apply_generic at /Users/ethananderes/Software/juliaMaster/src/gf.c:2196
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_finish_task at /Users/ethananderes/Software/juliaMaster/src/task.c:167
jl_threadfun at /Users/ethananderes/Software/juliaMaster/src/partr.c:217
_pthread_body at /usr/lib/system/libsystem_pthread.dylib (unknown line)
_pthread_start at /usr/lib/system/libsystem_pthread.dylib (unknown line)
Generating precompile statements... 913 generated in 240.093567 seconds (overhead 212.618210 seconds)
    LINK usr/lib/julia/sys.dylib

No error but same weird errors during compilation: loading: simplifications of code logic (#29807)

git checkout f31b6e5fec63ab17d99d9cf2b27cd3020eae904b

No error but same weird errors during compilation: Add ProcessExitedError rather than using error (#27900)

git checkout b3b6d030860a55ffd329f24aeafac01988e560fb

Error here: allow any thread, one at a time, to block in the event loop (#31438)

git checkout 0136fa10d5ba6782465ea68de37acefcb548bd9c

julia> versioninfo()
Julia Version 1.2.0-DEV.646
Commit 0136fa10d5* (2019-04-04 03:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_FFTW_PROVIDER = MKL

julia> function foo(x::Vector{T}) where T
           v = fill(T(0), length(x))
           Threads.@threads for i=1:length(x)
               v[i] = exp(x[i])
           end
           v
       end;

julia> x = rand(1000);

julia> import PyPlot

julia> foo(x);
fatal: error thrown and no exception handler available.
ErrorException("concurrency violation detected")
rec_backtrace at /Users/ethananderes/Software/juliaMaster/src/stackwalk.c:94
record_backtrace at /Users/ethananderes/Software/juliaMaster/src/task.c:210
jl_throw at /Users/ethananderes/Software/juliaMaster/src/task.c:417
error at ./error.jl:33
assert_havelock at ./condition.jl:20 [inlined]
assert_havelock at ./condition.jl:43 [inlined]
assert_havelock at ./condition.jl:67 [inlined]
notify at ./condition.jl:118
#notify#463 at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
uv_timercb at ./asyncevent.jl:148
jlcapi_uv_timercb_4198 at /Users/ethananderes/Software/juliaMaster/usr/lib/julia/sys.dylib (unknown line)
uv__run_timers at /workspace/srcdir/libuv/src/timer.c:162
uv_run at /workspace/srcdir/libuv/src/unix/core.c:352
jl_task_get_next at /Users/ethananderes/Software/juliaMaster/src/partr.c:303
poptaskref at ./task.jl:564
wait at ./task.jl:591
task_done_hook at ./task.jl:327
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_finish_task at /Users/ethananderes/Software/juliaMaster/src/task.c:167
jl_threadfun at /Users/ethananderes/Software/juliaMaster/src/partr.c:217
fatal: error thrown and no exception handler available.
ErrorException("concurrency violation detected")
rec_backtrace at /Users/ethananderes/Software/juliaMaster/src/stackwalk.c:94
record_backtrace at /Users/ethananderes/Software/juliaMaster/src/task.c:210
jl_throw at /Users/ethananderes/Software/juliaMaster/src/task.c:417
_pthread_body at /usr/lib/system/libsystem_pthread.dylib (unknown line)
_pthread_start at /usr/lib/system/libsystem_pthread.dylib (unknown line)
error at ./error.jl:33
assert_havelock at ./condition.jl:20 [inlined]
assert_havelock at ./condition.jl:43 [inlined]
assert_havelock at ./condition.jl:67 [inlined]
notify at ./condition.jl:118
#notify#463 at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
uv_timercb at ./asyncevent.jl:148
jlcapi_uv_timercb_4198 at /Users/ethananderes/Software/juliaMaster/usr/lib/julia/sys.dylib (unknown line)
uv__run_timers at /workspace/srcdir/libuv/src/timer.c:162
uv_run at /workspace/srcdir/libuv/src/unix/core.c:375
jl_task_get_next at /Users/ethananderes/Software/juliaMaster/src/partr.c:303
poptaskref at ./task.jl:564
wait at ./task.jl:591
task_done_hook at ./task.jl:327
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_finish_task at /Users/ethananderes/Software/juliaMaster/src/task.c:167
start_task at /Users/ethananderes/Software/juliaMaster/src/task.c:593
fatal: error thrown and no exception handler available.
ErrorException("concurrency violation detected")
rec_backtrace at /Users/ethananderes/Software/juliaMaster/src/stackwalk.c:94
record_backtrace at /Users/ethananderes/Software/juliaMaster/src/task.c:210
jl_throw at /Users/ethananderes/Software/juliaMaster/src/task.c:417
error at ./error.jl:33
assert_havelock at ./condition.jl:20 [inlined]
assert_havelock at ./condition.jl:43 [inlined]
assert_havelock at ./condition.jl:67 [inlined]
notify at ./condition.jl:118
notify_error at ./condition.jl:129 [inlined]
_uv_hook_close at ./asyncevent.jl:132
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1604 [inlined]
jl_uv_call_close_callback at /Users/ethananderes/Software/juliaMaster/src/jl_uv.c:81 [inlined]
jl_uv_closeHandle at /Users/ethananderes/Software/juliaMaster/src/jl_uv.c:100
Segmentation fault: 11
nlw0 commented 5 years ago

I believe I have had this same issue, I hope this can help. At first I thought could be related to SDL or using BenchmarkTools, or running Julia from within emacs, but I have been able to reproduce it from the Julia command line and without using any libraries, just by running the following code:

using Base.Threads

function fun()
    for f in 1:1024
        @threads for j in 1:2
        end
    end
    nothing
end

fun()

You just need to include it a few times before it happens. So I guess it depends on some sort of race condition, or whatever, that just turns out to happen more frequently if you load any libraries, etc.

julia> while true
         @show include("thtest.jl")
       end
include("thtest.jl") = nothing
include("thtest.jl") = nothing
include("thtest.jl") = nothing
include("thtest.jl") = nothing

fatal: error thrown and no exception handler available.
ErrorException("concurrency violation detected")
rec_backtrace at /home/user/src/julia/src/stackwalk.c:94
record_backtrace at /home/user/src/julia/src/task.c:210 [inlined]
jl_throw at /home/user/src/julia/src/task.c:417
error at ./error.jl:33
assert_havelock at ./condition.jl:20 [inlined]
assert_havelock at ./condition.jl:43 [inlined]
assert_havelock at ./condition.jl:67 [inlined]
notify at ./condition.jl:118
#notify#463 at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
notify at ./condition.jl:116 [inlined]
readcb_specialized at ./stream.jl:558
jfptr_readcb_specialized_4895 at /home/user/src/julia/usr/lib/julia/sys.so (unknown line)
jl_apply_generic at /home/user/src/julia/src/gf.c:2191
uv_readcb at ./stream.jl:599
jlcapi_uv_readcb_4104 at /home/user/src/julia/usr/lib/julia/sys.so (unknown line)
uv__read at /workspace/srcdir/libuv/src/unix/stream.c:1179
uv__stream_io at /workspace/srcdir/libuv/src/unix/stream.c:1339
uv__io_poll at /workspace/srcdir/libuv/src/unix/linux-core.c:361
uv_run at /workspace/srcdir/libuv/src/unix/core.c:361
jl_task_get_next at /home/user/src/julia/src/partr.c:303
poptaskref at ./task.jl:564
wait at ./task.jl:591
task_done_hook at ./task.jl:327
jl_apply_generic at /home/user/src/julia/src/gf.c:2191
jl_apply at /home/user/src/julia/src/julia.h:1604 [inlined]
jl_finish_task at /home/user/src/julia/src/task.c:167
start_task at /home/user/src/julia/src/task.c:593
unknown function (ip: 0xffffffffffffffff)

And this is with the jl_wake_libuv(); fix (or at least I think it is). Also, the outer loop seems to be necessary, or at least helps triggering it. EDIT3: everything works fine for me with 1be1deb10c90b612a57d3541d3bf1aae1a42e104

chriselrod commented 5 years ago

I just ran into this segfault

julia> @time chains, tuned_sampler = NUTS_init_tune_mcmc(ada, 1000);
MCMC, adapting ϵ (75 steps)
5.5 s/step ...done
MCMC, adapting ϵ (25 steps)
5.7 s/step ...done
MCMC, adapting ϵ (50 steps)
52.0 s/step ...done
MCMC, adapting ϵ (100 steps)
step 100 (of 100), 52.0 s/step
52.0 s/step ...done
MCMC, adapting ϵ (200 steps)
step 100 (of 200), 52.0 s/step
step 200 (of 200), 52.0 s/step
52.0 s/step ...done
MCMC, adapting ϵ (400 steps)
step 100 (of 400), 52.0 s/step
step 200 (of 400), 52.0 s/step
step 300 (of 400), 52.0 s/step
step 400 (of 400), 52.0 s/step
52.0 s/step ...done
MCMC, adapting ϵ (50 steps)
53.0 s/step ...done
MCMC (1000 steps)
step 100 (of 1000), 53.0 s/step
fatal: error thrown and no exception handler available.
ErrorException("concurrency violation detected")
rec_backtrace at /home/chriselrod/Documents/languages/julia/src/stackwalk.c:94
record_backtrace at /home/chriselrod/Documents/languages/julia/src/task.c:210 [inlined]
jl_throw at /home/chriselrod/Documents/languages/julia/src/task.c:417
error at ./error.jl:33
concurrency_violation at ./condition.jl:8
assert_havelock at ./condition.jl:26 [inlined]
assert_havelock at ./condition.jl:49 [inlined]
assert_havelock at ./condition.jl:73 [inlined]
notify at ./condition.jl:124
#notify#468 at ./condition.jl:122 [inlined]
notify at ./condition.jl:122 [inlined]
uv_fseventscb_file at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.3/FileWatching/src/FileWatching.jl:318
unknown function (ip: 0x7f9da2588e5a)
jlcapi_uv_fseventscb_file_24324_gfthunk at /home/chriselrod/Documents/languages/julia/usr/lib/julia/sys.so (unknown line)
jlcapi_uv_fseventscb_file_24324 at /home/chriselrod/Documents/languages/julia/usr/lib/julia/sys.so (unknown line)
uv__inotify_read at /home/chriselrod/Documents/languages/julia/deps/srccache/libuv-2348256acf5759a544e5ca7935f638d2bc091d60/src/unix/linux-inotify.c:193
uv__io_poll at /home/chriselrod/Documents/languages/julia/deps/srccache/libuv-2348256acf5759a544e5ca7935f638d2bc091d60/src/unix/linux-core.c:361
uv_run at /home/chriselrod/Documents/languages/julia/deps/srccache/libuv-2348256acf5759a544e5ca7935f638d2bc091d60/src/unix/core.c:361
jl_task_get_next at /home/chriselrod/Documents/languages/julia/src/partr.c:306
poptaskref at ./task.jl:564
wait at ./task.jl:591
task_done_hook at ./task.jl:327
jl_apply at /home/chriselrod/Documents/languages/julia/src/julia.h:1610 [inlined]
jl_finish_task at /home/chriselrod/Documents/languages/julia/src/task.c:167
start_task at /home/chriselrod/Documents/languages/julia/src/task.c:593
unknown function (ip: 0xffffffffffffffff)

This ran for about 13 hours before segfaulting (52 seconds / step * 900 steps / (60^2 seconds / hour). The code has an @threads loop, along with a global const logdensity_lock = Threads.SpinLock() used to lock and unlock the threads. The specific part of the function:

    nthreads = Threads.nthreads()
    Threads.@threads for thread ∈ 1:nthreads
        thread_target = 0.0
        thread_∂invσ = 0.0
        thread_∂β = 0.0
        sat_start = 1 + (thread - 1)*S ÷ nthreads
        sat_end = thread*S ÷ nthreads
        gradp = @MVector zeros(4)
        gradm = @MVector zeros(7)
        obsstart = thread == 1 ? 1 : SatInds[sat_start - 1] + 1
        @inbounds for sat ∈ sat_start:sat_end
            obsend = SatInds[sat]
            sat_μ = b_sat[sat]
            for obsind ∈ obsstart:obsend
                p, m = PropMethods[obsind]
                l2n = log2Ns[obsind]
                ad = SatADs[obsind]
                μ = p < Pp1 ?  b_prop[p] + sat_μ : sat_μ
                μ = m < Mp1 ? b_method[m] + μ : μ
                t, ∂t∂μ, ∂t∂invσ = ∂normal_lcdf(ad, μ + l2n * β, invσ)
                thread_target += t
                thread_∂invσ += ∂t∂invσ
                thread_∂β += ∂t∂μ * l2n
                grad[sat] += ∂t∂μ
                if p < Pp1
                    gradp[p] += ∂t∂μ
                end
                if m < Mp1
                    gradm[m] += ∂t∂μ
                end
            end
            obsstart = obsend + 1
        end
        Threads.lock(logdensity_lock)
        common_target[] += thread_target
        common_∂invσ[] += thread_∂invσ
        common_∂β[] += thread_∂β
        for m ∈ 1:M
            grad[S+m] = gradm[m]
        end
        for p ∈ 1:P
            grad[S+M+p] = gradp[p]
        end
        Threads.unlock(logdensity_lock)
    end

This example is far from minimal. Given it took 13 hours to get here, it may take me sometime to get something smaller to reproduce the problem.

Commenting to say that the issue still seems to be here. That was with:

julia> versioninfo()
Julia Version 1.3.0-DEV.185
Commit c9777b0ab4* (2019-05-08 14:34 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.0 (ORCJIT, skylake)

That's what I get for running slow code on master for no good reason, ha ha.

JeffBezanson commented 5 years ago

The error is from a FileWatching event. Are you using Revise?

chriselrod commented 5 years ago

Yes. Next time I start a long running threaded program, I'll start julia with --startup=no (using Revise is in my startup.jl).