JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.75k stars 5.49k forks source link

Segmentation fault in GraphPPL.jl tests on 1.11, works fine in debugger #56459

Open bvdmitri opened 1 week ago

bvdmitri commented 1 week ago

This test in GraphPPL.jl causes segmentation fault. The segmentation fault can be reproduced by copy-pasting the content of the test (plus necessary imports) in REPL. Interestingly enough the test passes normally while debugging. So the notable thing is that this line

y = getorcreate!(model, ctx, :y, 1)

should return a fully initialized y, but on 1.11 it returns an array of #undef values. Image

The code in the loop uses isassigned under the hood to initialize the elements of y and the check works correctly during the debugging and in 1.10, e.g in VSCode debugger view I get Image

The fact that debugging works normally does not really allow us to narrow down the scope of the issue. It also doesn't seem to happen in real code that relies on this functionality, only in tests. Julia shouldn't really segfault so it might indicate deeper problems somewhere else.

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m3)
Threads: 1 default, 0 interactive, 1 GC (on 5 virtual cores)

The code that segfaults is on the main branch

commit c97718a10bcf035cff093acf52ee9fe30f225b35 (HEAD -> main, origin/main, origin/HEAD)
Author: Wouter Nuijten <wouternuijten@gmail.com>
Date:   Fri Oct 11 11:44:21 2024 +0200

    Update codecov action
(GraphPPL) pkg> st
Project GraphPPL v4.3.3 
Status `~/.julia/dev/GraphPPL.jl/Project.toml`
  [0f2f92aa] BitSetTuples v1.1.5
  [864edb3b] DataStructures v0.18.20
  [85a47980] Dictionaries v0.4.2
  [1914dd2f] MacroTools v0.5.13
  [fa8bd995] MetaGraphsNext v0.7.1
  [d9ec5142] NamedTupleTools v0.14.3
  [aedffcd0] Static v1.1.1
  [90137ffa] StaticArrays v1.9.8
  [9d95972d] TupleTools v1.6.0
  [9602ed7d] Unrolled v0.1.5
bvdmitri commented 1 week ago

The error

julia> GraphPPL.add_terminated_submodel!(model, ctx, options, hgf, (y = y,), static(1))

[54306] signal 11 (2): Segmentation fault: 11
in expression starting at REPL[28]:1
add_edge! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1748 [inlined]
add_edge! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1702
unknown function (ip: 0x327230123)
#93 at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2093
foreach at ./abstractarray.jl:3187 [inlined]
materialize_factor_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2092 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2078
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1976 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1905 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1901 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1887 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:594 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/test/testutils.jl:243 [inlined]
add_terminated_submodel! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:726
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:710
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2034 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1892 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1887 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:545 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/test/testutils.jl:248 [inlined]
add_terminated_submodel! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:726 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:710
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:2034 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1892 [inlined]
make_node! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/graph_engine.jl:1887 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:548 [inlined]
macro expansion at /Users/bvdmitri/.julia/dev/GraphPPL.jl/test/testutils.jl:270 [inlined]
add_terminated_submodel! at /Users/bvdmitri/.julia/dev/GraphPPL.jl/src/model_macro.jl:726
unknown function (ip: 0x3271ed073)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./julia.h:2157 [inlined]
do_call at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_stmt_value at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/interpreter.c:174
eval_body at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/interpreter.c:663
jl_interpret_toplevel_thunk at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/interpreter.c:821
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:886
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:886
jl_toplevel_eval_flex at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:952 [inlined]
ijl_toplevel_eval_in at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
eval_user_input at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:245
repl_backend_loop at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:342
#start_repl_backend#59 at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:327
start_repl_backend at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:324
#run_repl#72 at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:483
run_repl at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:469
jfptr_run_repl_10089 at /Users/bvdmitri/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/compiled/v1.11/REPL/u0gqU_pEq4i.dylib (unknown line)
#1139 at ./client.jl:446
jfptr_YY.1139_14579 at /Users/bvdmitri/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/compiled/v1.11/REPL/u0gqU_pEq4i.dylib (unknown line)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./julia.h:2157 [inlined]
jl_f__call_latest at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1055 [inlined]
invokelatest at ./essentials.jl:1052 [inlined]
run_main_repl at ./client.jl:430
repl_main at ./client.jl:567 [inlined]
_start at ./client.jl:541
jfptr__start_72559.1 at /Users/bvdmitri/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./julia.h:2157 [inlined]
true_main at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/jlapi.c:1059
Allocations: 30712733 (Pool: 30710467; Big: 2266); GC: 42
[1]    54306 segmentation fault  julia
giordano commented 1 week ago

Can we get a self-contained MWE instead of referencing code on another repository? I'm asking this also because

The segmentation fault can be reproduced by copy-pasting the content of the test (plus necessary imports) in REPL.

is very much not true as one needs also to copy a bunch of definitions in https://github.com/ReactiveBayes/GraphPPL.jl/blob/c97718a10bcf035cff093acf52ee9fe30f225b35/test/testutils.jl and tracking down all missing imports (which are a lot) isn't fun.

Side note, sometimes starting julia with --check-bounds=yes helps tracking down segfaults, if caused by indexing arrays out-of-bounds.

bvdmitri commented 1 week ago

Ok, the problem with #undef seems to be fully visual, it is initialized, but the generic code for show prints it as #undef for whatever reason. But the segmentation fault in 1.11 is still real.

instead of referencing code on another repository

Here is minimal I could come up with @giordano, however, the issue appears in the repository and I cannot create an MWE without the package:

using GraphPPL, Distributions
import GraphPPL: @model

@model function gcv(κ, ω, z, x, y)
    log_σ := κ * z + ω
    y ~ Normal(x, exp(log_σ))
end

@model function gcv_lm(y, x_prev, x_next, z, ω, κ)
    x_next ~ gcv(x = x_prev, z = z, ω = ω, κ = κ)
    y ~ Normal(x_next, 1)
end

@model function hgf(y)

    # Specify priors

    ξ ~ Gamma(1, 1)
    ω_1 ~ Normal(0, 1)
    ω_2 ~ Normal(0, 1)
    κ_1 ~ Normal(0, 1)
    κ_2 ~ Normal(0, 1)
    x_1[1] ~ Normal(0, 1)
    x_2[1] ~ Normal(0, 1)
    x_3[1] ~ Normal(0, 1)

    # Specify generative model

    for i in 2:(length(y) + 1)
        x_3[i] ~ Normal(x_3[i - 1], ξ)
        x_2[i] ~ gcv(x = x_2[i - 1], z = x_3[i], ω = ω_2, κ = κ_2)
        x_1[i] ~ gcv_lm(x_prev = x_1[i - 1], z = x_2[i], ω = ω_1, κ = κ_1, y = y[i - 1])
    end
end

function mwe()
    model = GraphPPL.Model(identity, GraphPPL.PluginsCollection(), GraphPPL.DefaultBackend())
    ctx = GraphPPL.getcontext(model)
    y = nothing
    for i in 1:10
        y = GraphPPL.getorcreate!(model, ctx, :y, i)
    end
    GraphPPL.add_terminated_submodel!(model, ctx, GraphPPL.NodeCreationOptions(), hgf, (y = y,), GraphPPL.static(1))
    return model
end

mwe() isa GraphPPL.Model

This code segfaults in 1.11.

I also tried to manually debug it with no success. I also dev-ed all the dependencies and removed all the @inbounds from their code. It didn't help. Using --check-bounds=yes didn't help to identify the issue either. However, what I noticed is that if I change the following code in GraphPPL from

for variable_node in variable_nodes
        add_edge!(model, factor_node_id, factor_node_propeties, variable_node, interface_name, index)
        index += increase_index(variable_node)
end

to

foreach(variable_nodes) do variable_node
        add_edge!(model, factor_node_id, factor_node_propeties, variable_node, interface_name, index)
        index += increase_index(variable_node)
    end

fixes the problem and there is no segmentation fault. My CS expertise is not good enough to track down segmentation faults.

giordano commented 1 week ago

however, the issue appears in the repository and I cannot create an MWE without the package:

While a reproducer should preferably be as small as possible (crafting a minimal reproducer, for example by binary search if you have no other clue, is already a large chunk of the work of hunting down a bug), saying "go and copy some code from somewhere else" doesn't work very well. I tried for like 10 minutes to build the example by copying the code piece by piece from the tests but gave up out of frustration because I'm not familiar with the codebase and didn't know what to do exactly.

That said, the segfault doesn't seem to reproduce on master (at least not on ee09ae70d9f4a04ed8b745f36d3c5d9d578d2887, ~on some later versions JLD2.jl is broken~ Edit: JLD2 v0.5.8 fixed the issue) for me, so the bisection could be done to find the patch which fixed it.

bvdmitri commented 1 week ago

saying "go and copy some code from somewhere else" doesn't work very well. I tried for like 10 minutes

Point taken, indeed I thought it would be easier, sorry for not preparing a better MWE. Nice to hear that it is fixed on master. I can try run the bisection, is there a script that simplifies this process?

giordano commented 1 week ago

is there a script that simplifies this process?

I usually use a variation of following script with git bisect run, depending on what exactly is needed to reproduce the bug

#!/bin/bash

make cleanall || true
make -j60 USECCACHE=1 || exit 125

./usr/bin/julia --startup-file=no my_reproducer.jl

EXIT_CODE=$?
if [[ "${EXIT_CODE}" -eq 139 ]]; then
    # For git bisect we need to return an exit status less than 128, but if a
    # program segfaults with exit code 11+129=139 we return 11.  Don't change
    # all other cases.
    exit 11
else
    exit "${EXIT_CODE}"
fi
bvdmitri commented 5 days ago

Well I tried for quite some time to run git bisect (for a couple of hours given the compilation time), but it either says Some good revs are not ancestors of the bad rev. or Bisecting: a merge base must be tested. I tried bisecting from v1.11 to master. I think v1.11 and master have diverged? I'm not sure how I'm supposed to bisect it so any help is appreciated here. How am I supposed to identify linear commit history to just run git bisect run?

giordano commented 5 days ago

Releases are cut from branches, not from master. Find the first commit in the release 1.11 branch since the branching out, the parent will be in master. Also, check if you can reproduce the bug on 1.11 alpha 0, 1 or whatever that's called, that gives you an idea of what direction to look at

bvdmitri commented 5 days ago

Find the first commit in the release 1.11 branch since the branching out, the parent will be in master.

That's what I'm struggling with, I'm not sure how to do it

giordano commented 5 days ago

From the github web interface: go to https://github.com/JuliaLang/julia, choose the release-1.11 branch, you get to https://github.com/JuliaLang/julia/tree/release-1.11, click on 448 commits ahead of and get to https://github.com/JuliaLang/julia/compare/master...release-1.11. The top commit (https://github.com/JuliaLang/julia/commit/7dad444e7b73a5bc993ee3a3839c79acb8620fe4) is the first one since branching out, its parent https://github.com/JuliaLang/julia/commit/aecd8fd379a53afa780bc8a8404728b6aa22d6bc is on master

From the command line, you can probably do something like git log master...release-1.11, or something like that (I can't check it on the phone). Edit: you can use git log --reverse --oneline master..origin/release-1.11 to see what's the first commit.

giordano commented 4 days ago

Couple of comments:

I'd say this is the range to look into for the fix: https://github.com/JuliaLang/julia/compare/aecd8fd379a53afa780bc8a8404728b6aa22d6bc...ee09ae70d9f4a04ed8b745f36d3c5d9d578d2887 (first is bad, last is good). Edit: for the record, it reproduces also on a06a80162bb9bdf6f7e91dc18e7ccf5c12673ca4 but not 4b27a169bda6ac970fc677962c30af51a6a9ca74

giordano commented 4 days ago

Good news: the segfault disappeared on 25cbe006f3a610c204d8f2f67f1200a13a8ce349 (merge commit of #55767). Bad news: that looks a bit too large of a commit to backport it to v1.11. CC: @vtjnash in case he has a clue of how to solve this on v1.11.

vtjnash commented 4 days ago

That does at least give us a pretty good idea of what kind of issue it is likely to be. Somewhat hard to be sure if it is better just to backport that (lots of lines, but very low risk internal only change which only helps Enzyme support this version easier even though it also breaks Enzyme) or investigate whether a more specific fix is possible

giordano commented 4 days ago

Segfault first appeared in #52405 (corresponding change in our fork of llvm: https://github.com/JuliaLang/llvm-project/pull/23)

e5046b4579cf571931714abbe14a3a049ca6383b is the first bad commit
commit e5046b4579cf571931714abbe14a3a049ca6383b
Author: Gabriel Baraldi <baraldigabriel@gmail.com>
Date:   Thu Dec 7 11:21:38 2023 -0500

    Bump LLVM to 15.0.7+10 to fix GC issue (#52405)

 deps/checksums/clang            | 216 ++++++++++----------
 deps/checksums/lld              | 216 ++++++++++----------
 deps/checksums/llvm             | 436 ++++++++++++++++++++--------------------
 deps/clang.version              |   2 +-
 deps/lld.version                |   2 +-
 deps/llvm-tools.version         |   4 +-
 deps/llvm.version               |   6 +-
 stdlib/LLD_jll/Project.toml     |   2 +-
 stdlib/libLLVM_jll/Project.toml |   2 +-
 9 files changed, 443 insertions(+), 443 deletions(-)

but this looks unhelpful, since it was backported to julia v1.10 (https://github.com/JuliaLang/julia/commit/1e66ce2de71e1215f9021a300ad30ef95427a765) and llvm 15 isn't in julia v1.11