JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.68k stars 5.48k forks source link

GC error (probable corruption) #43567

Open DatName opened 2 years ago

DatName commented 2 years ago

I have a relatively big multithreaded application which runs fine on 1.6.4, but segfaults on 1.7 and 1.7.1. I will try to create a minimal example which reproduces this segfault, but for now I have console log only:

GC error (probable corruption) :
Allocations: 480045702 (Pool: 479950106; Big: 95596); GC: 244
Array{
!!! ERROR in jl_ -- ABORTING !!!
0x7f4734343100: Queued root: 0x7f46a8784010 :: 0x7f46d8f494b0 (bits: 3)
        of type 
!!! ERROR in jl_ -- ABORTING !!!
0x7f4734343118: Queued root: 0x7f46a861c010 :: 0x7f46d8f494b0 (bits: 3)
        of type 
!!! ERROR in jl_ -- ABORTING !!!
0x7f4734343130: Queued root: 0x7f46dbafa650 :: 0x7f46d7de01a0 (bits: 3)
        of type 
!!! ERROR in jl_ -- ABORTING !!!
0x7f4734343148: Queued root: 0x7f4677dd8ad0 :: 0x7f46d7de01a0 (bits: 3)
        of type 

....

!!! ERROR in jl_ -- ABORTING !!!
0x7f4734344660: Queued root: 0x7f46adf04e70 :: 0x7f476597cc40 (bits: 3)
        of type 
!!! ERROR in jl_ -- ABORTING !!!
0x7f4734344678:  r-- Stack frame 0x7f46c3676240 -- 1 of 6 (direct)
0x7f47343446a0:   `- Stack frame 0x7f4652fcf060 -- 124 of 298 (direct)

signal (6): Aborted
in expression starting at none:0
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gc_assert_datatype_fail at /buildworker/worker/package_linux64/build/src/gc.c:1657
gc_mark_loop at /buildworker/worker/package_linux64/build/src/gc.c:2711
_jl_gc_collect at /buildworker/worker/package_linux64/build/src/gc.c:3039
jl_gc_collect at /buildworker/worker/package_linux64/build/src/gc.c:3248
maybe_collect at /buildworker/worker/package_linux64/build/src/gc.c:882 [inlined]
jl_gc_pool_alloc at /buildworker/worker/package_linux64/build/src/gc.c:1209
export_event at /path/src/events/process_events.jl:99
process_event at /path/src/events/process_events.jl:12
guarded_process_event at /path/src/server/state/start.jl:387
unknown function (ip: 0x7f46a42d1cd2)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
consume_output_events at /path/src/server/state/start.jl:381
unknown function (ip: 0x7f46c5e8facd)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
macro expansion at /path/src/task_utils/generic_handler.jl:64 [inlined]
#35 at /home/.julia/packages/ThreadPools/hwwUU/src/macros.jl:261
unknown function (ip: 0x7f46c5e8d2df)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
Allocations: 480045702 (Pool: 479950106; Big: 95596); GC: 244
Aborted (core dumped)

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Keno commented 2 years ago

Unfortunately this will not be debuggable without reproducer or rr trace.

DatName commented 2 years ago

I see. When I run it with

export JULIA_NUM_THREADS=12
./julia --bug-report=rr-local

the program just stalls on a non-blocking call:

julia> start!(ctx)
[ Info: Listening on: 0.0.0.0:26000

^CERROR: InterruptException:
Stacktrace:
  [1] poptask(W::Base.InvasiveLinkedListSynchronized{Task})
    @ Base ./task.jl:827
  [2] wait()
    @ Base ./task.jl:836
  [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
    @ Base ./condition.jl:123
  [4] wait(x::Base.Process)
    @ Base ./process.jl:627
  [5] success
    @ ./process.jl:489 [inlined]
  [6] run(::Cmd; wait::Bool)
    @ Base ./process.jl:446
  [7] run
    @ ./process.jl:444 [inlined]
  [8] (::BugReporting.var"#7#8"{Nothing, Tuple{Cmd, Vector{String}}})(rr_path::String)
    @ BugReporting ~/.julia/packages/BugReporting/7auqP/src/BugReporting.jl:132
  [9] (::JLLWrappers.var"#2#3"{BugReporting.var"#7#8"{Nothing, Tuple{Cmd, Vector{String}}}, String})()
    @ JLLWrappers ~/.julia/packages/JLLWrappers/bkwIo/src/runtime.jl:49
 [10] withenv(::JLLWrappers.var"#2#3"{BugReporting.var"#7#8"{Nothing, Tuple{Cmd, Vector{String}}}, String}, ::Pair{String, String}, ::Vararg{Pair{String, String}})
    @ Base ./env.jl:172
 [11] withenv_executable_wrapper(f::Function, executable_path::String, PATH::String, LIBPATH::String, adjust_PATH::Bool, adjust_LIBPATH::Bool)
    @ JLLWrappers ~/.julia/packages/JLLWrappers/bkwIo/src/runtime.jl:48
 [12] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [13] invokelatest
    @ ./essentials.jl:714 [inlined]
 [14] #rr#7
    @ ~/.julia/packages/JLLWrappers/bkwIo/src/products/executable_generators.jl:7 [inlined]
 [15] rr
    @ ~/.julia/packages/JLLWrappers/bkwIo/src/products/executable_generators.jl:7 [inlined]
 [16] #rr_record#6
    @ ~/.julia/packages/BugReporting/7auqP/src/BugReporting.jl:122 [inlined]
 [17] rr_record
    @ ~/.julia/packages/BugReporting/7auqP/src/BugReporting.jl:119 [inlined]
 [18] make_interactive_report(report_type::String, ARGS::Vector{String})
    @ BugReporting ~/.julia/packages/BugReporting/7auqP/src/BugReporting.jl:208
 [19] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [20] invokelatest
    @ ./essentials.jl:714 [inlined]
 [21] report_bug(kind::String)
    @ InteractiveUtils ~/code/julia/julia-1.7.1/share/julia/stdlib/v1.7/InteractiveUtils/src/InteractiveUtils.jl:397
 [22] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:233
 [23] _start()
    @ Base ./client.jl:495

Could this be by any chance related?

Keno commented 2 years ago

Could this be by any chance related?

Perhaps, but the backtrace is of the outside process not where it's actually blocked. Also rr can make things slow, so you may just need to let it run for a while.

JeffBezanson commented 2 years ago

You can also try running with --check-bounds=yes.

aeisman commented 2 years ago

I've had a similar problem with 1.7.2. Downgraded to 1.6.6 LTS and it resolved so does appear to be Julia version specific.

DilumAluthge commented 2 years ago

I talked with Aaron out-of-band, and here are some more details on the code he ran:

He has a function gwas_extract_snps defined as such:

function gwas_extract_snps(gwas_fh,gwas_keep_fh,keep_snp_set,delim)
    # extract keep_snp_set of snps from a gwas file
    gwas_io = GZip.open(gwas_fh)
    gwas_keep_io = open(gwas_keep_fh,"w")
    i = 1
    for line in eachline(gwas_io)
        snp = split(line,delim)[2]
        if in(snp,keep_snp_set)
            write(gwas_keep_io,line*"\n")
        end
        i += 1
        if (i % 1000000) == 0
            #println(i)
        end
    end
    close(gwas_io)
    close(gwas_keep_io)
end

And then he has a Distributed for loop of the form:

Distributed.@distributed vcat for met in met_arr_keep
    #download file from google bucket
    #run gwas_extract_snps()
    #delete original file
end

This table shows whether or not he gets the segfault. ✅ means no segfault. ❌ means he encountered the segfault.

Julia version @distributed -p Result Notes
1.6.6 yes 2 :white_check_mark: Command-line
1.7.2 yes 2 :x: Command-line
1.7.2 no 1 :white_check_mark: REPL

His data cannot be shared publicly, unfortunately, so we don't have an MWE.