JuliaPy / Conda.jl

Conda managing Julia binary dependencies
Other
171 stars 57 forks source link

Reproducible segmentation fault during build on linux/arm64 #228

Open MilesCranmer opened 1 year ago

MilesCranmer commented 1 year ago

I'm trying to build docker images for PySR (which is built on PyJulia), and the arm64 jobs fail consistently because of a segmentation fault when building Conda.jl. The amd64 jobs are fine.

Here's the traceback: ``` #15 69.87 Building Conda ─→ `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/6e47d11ea2776bc5627421d59cdcc1296c058071/build.log` #15 84.11 ERROR: LoadError: Error building `Conda`: #15 94.97 #15 94.97 signal (11): Segmentation fault #15 94.97 in expression starting at /root/.julia/packages/Conda/x2UxR/deps/build.jl:106 #15 94.97 top-level scope at /root/.julia/packages/Conda/x2UxR/deps/build.jl:106 #15 94.97 jl_toplevel_eval_flex at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/toplevel.c:897 #15 94.97 jl_toplevel_eval_flex at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/toplevel.c:850 #15 94.97 ijl_toplevel_eval_in at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/toplevel.c:965 #15 94.97 eval at ./boot.jl:368 [inlined] #15 94.97 include_string at ./loading.jl:1428 #15 94.97 _jl_invoke at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined] #15 94.97 ijl_apply_generic at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/gf.c:2549 #15 94.97 _include at ./loading.jl:1488 #15 94.97 include at ./client.jl:476 #15 94.97 unknown function (ip: 0x55170ff553) #15 94.97 _jl_invoke at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined] #15 94.97 ijl_apply_generic at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/gf.c:2549 #15 94.97 jl_apply at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/julia.h:1839 [inlined] #15 94.97 do_call at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/interpreter.c:126 #15 94.97 eval_value at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/interpreter.c:215 #15 94.97 eval_stmt_value at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/interpreter.c:166 [inlined] #15 94.97 eval_body at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/interpreter.c:612 #15 94.97 jl_interpret_toplevel_thunk at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/interpreter.c:750 #15 94.97 jl_toplevel_eval_flex at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/toplevel.c:906 #15 94.97 jl_toplevel_eval_flex at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/toplevel.c:850 #15 94.97 ijl_toplevel_eval_in at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/toplevel.c:965 #15 94.97 eval at ./boot.jl:368 [inlined] #15 94.97 exec_options at ./client.jl:276 #15 94.97 _start at ./client.jl:522 #15 94.97 jfptr__start_49[479](https://github.com/MilesCranmer/PySR/actions/runs/3474728580/jobs/5808212454#step:7:482) at /opt/julia/lib/julia/sys.so (unknown line) #15 94.97 _jl_invoke at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined] #15 94.97 ijl_apply_generic at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/gf.c:2549 #15 94.97 jl_apply at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/julia.h:1839 [inlined] #15 94.97 true_main at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/jlapi.c:575 #15 94.97 jl_repl_entrypoint at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/src/jlapi.c:719 #15 94.97 main at /cache/build/default-armageddon-0/julialang/julia-release-1-dot-8/cli/loader_exe.c:59 #15 94.97 __libc_start_main at /lib/aarch64-linux-gnu/libc.so.6 (unknown line) #15 94.97 _start at /opt/julia/bin/julia (unknown line) #15 94.97 _start at /opt/julia/bin/julia (unknown line) #15 94.97 Allocations: 873[483](https://github.com/MilesCranmer/PySR/actions/runs/3474728580/jobs/5808212454#step:7:486) (Pool: 872903; Big: 580); GC: 1 #15 94.99 Stacktrace: #15 94.99 [1] pkgerror(msg::String) #15 95.36 @ Pkg.Types /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Types.jl:67 #15 95.49 [2] (::Pkg.Operations.var"#66#73"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec, String})() #15 95.67 @ Pkg.Operations /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Operations.jl:1060 #15 95.67 [3] withenv(::Pkg.Operations.var"#66#73"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec, String}, ::Pair{String, String}, ::Vararg{Pair{String}}) #15 96.24 @ Base ./env.jl:172 #15 96.25 [4] (::Pkg.Operations.var"#107#112"{String, Bool, Bool, Bool, Pkg.Operations.var"#66#73"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec, String}, Pkg.Types.PackageSpec})() #15 96.25 @ Pkg.Operations /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Operations.jl:1619 #15 96.25 [5] with_temp_env(fn::Pkg.Operations.var"#107#112"{String, Bool, Bool, Bool, Pkg.Operations.var"#66#73"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec, String}, Pkg.Types.PackageSpec}, temp_env::String) #15 96.25 @ Pkg.Operations /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Operations.jl:1[493](https://github.com/MilesCranmer/PySR/actions/runs/3474728580/jobs/5808212454#step:7:496) #15 96.25 [6] (::Pkg.Operations.var"#105#110"{Dict{String, Any}, Bool, Bool, Bool, Pkg.Operations.var"#66#73"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec, String}, Pkg.Types.Context, Pkg.Types.PackageSpec, String, Pkg.Types.Project, String})(tmp::String) #15 96.25 @ Pkg.Operations /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Operations.jl:1582 #15 96.25 [7] mktempdir(fn::Pkg.Operations.var"#105#110"{Dict{String, Any}, Bool, Bool, Bool, Pkg.Operations.var"#66#73"{Bool, Pkg.Types.Context, String, Pkg.Types.PackageSpec, String}, Pkg.Types.Context, Pkg.Types.PackageSpec, String, Pkg.Types.Project, String}, parent::String; prefix::String) #15 96.26 @ Base.Filesystem ./file.jl:764 #15 96.26 [8] mktempdir(fn::Function, parent::String) (repeats 2 times) #15 96.26 @ Base.Filesystem ./file.jl:760 #15 96.26 [9] sandbox(fn::Function, ctx::Pkg.Types.Context, target::Pkg.Types.PackageSpec, target_path::String, sandbox_path::String, sandbox_project_override::Pkg.Types.Project; preferences::Dict{String, Any}, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool) #15 96.27 @ Pkg.Operations /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Operations.jl:1540 #15 96.27 [10] build_versions(ctx::Pkg.Types.Context, uuids::Set{Base.UUID}; verbose::Bool) #15 96.27 @ Pkg.Operations /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Operations.jl:1041 #15 96.27 [11] build_versions #15 96.27 @ /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Operations.jl:956 [inlined] #15 96.27 [12] add(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}, new_git::Set{Base.UUID}; preserve::Pkg.Types.PreserveLevel, platform::Base.BinaryPlatforms.Platform) #15 96.28 @ Pkg.Operations /opt/julia/share/julia/stdlib/v1.8/Pkg/src/Operations.jl:1286 #15 96.29 [13] add(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; preserve::Pkg.Types.PreserveLevel, platform::Base.BinaryPlatforms.Platform, kwargs::Base.Pairs{Symbol, Base.PipeEndpoint, Tuple{Symbol}, NamedTuple{(:io,), Tuple{Base.PipeEndpoint}}}) #15 96.58 @ Pkg.API /opt/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:275 #15 96.59 [14] add(pkgs::Vector{Pkg.Types.PackageSpec}; io::Base.PipeEndpoint, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}) #15 96.74 @ Pkg.API /opt/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:156 #15 96.74 [15] add(pkgs::Vector{Pkg.Types.PackageSpec}) #15 96.75 @ Pkg.API /opt/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:145 #15 96.75 [16] #add#27 #15 96.75 @ /opt/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:144 [inlined] #15 96.75 [17] add #15 96.75 @ /opt/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:144 [inlined] #15 96.75 [18] #add#26 #15 96.75 @ /opt/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:143 [inlined] #15 96.75 [19] add(pkg::String) #15 96.75 @ Pkg.API /opt/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:143 #15 96.75 [20] top-level scope #15 96.75 @ /usr/local/lib/python3.10/site-packages/julia/install.jl:118 #15 96.75 in expression starting at /usr/local/lib/python3.10/site-packages/julia/install.jl:73 #15 96.81 Traceback (most recent call last): #15 96.81 File "", line 1, in #15 96.81 File "/pysr/pysr/julia_helpers.py", line 79, in install #15 96.82 julia.install(quiet=quiet) #15 96.82 File "/usr/local/lib/python3.10/site-packages/julia/tools.py", line 118, in install #15 96.82 raise PyCallInstallError("Installing", output) #15 96.82 julia.tools.PyCallInstallError: Installing PyCall failed. #15 96.82 #15 96.82 ** Important information from Julia may be printed before Python's Traceback ** #15 96.82 #15 96.82 Some useful information may also be stored in the build log file #15 96.82 `~/.julia/packages/PyCall/*/deps/build.log`. ```

Here's the job result, the dockerfile, and the action file. This same error occurs every time I run the job.

The line it's getting a segfault on in build.jl: https://github.com/JuliaPy/Conda.jl/blob/8f7133206f3efb6308dff5a2b09393d10e6cc122/deps/build.jl#L106

Any idea what this is? @mkitti would you happen to know?

MilesCranmer commented 1 year ago

This is the C code where it crashes:

        size_t world = jl_atomic_load_acquire(&jl_world_counter);
        ct->world_age = world;
        if (!has_defs && jl_get_module_infer(m) != 0) {
            (void)jl_type_infer(mfunc, world, 0);
        }
        result = jl_invoke(/*func*/NULL, /*args*/NULL, /*nargs*/0, mfunc); // crashes
        ct->world_age = last_age;

https://github.com/JuliaLang/julia/blob/36034abf26062acad4af9dcec7c4fc53b260dbb4/src/toplevel.c#L897

MilesCranmer commented 1 year ago

The last PR to change this line where it segfaulted was https://github.com/JuliaLang/julia/pull/31984. @vtjnash @JeffBezanson any advice for how I could debug this? Or is this line unrelated?

vtjnash commented 1 year ago

We are trying to call into the JIT there, and so perhaps LLVM is computing the jump address incorrectly? The stacktrace is not quite precisely clear enough what that value is that it crashed on. LLVM is planning some fixes for that for AARCH64 in JITLink in the upcoming release though.

MilesCranmer commented 1 year ago

Thanks. Should I raise an issue on the main Julia repo or LLVM?

Here's a minimal dockerfile which gives the same error:

FROM julia:1.8.2
RUN julia -e 'using Pkg; Pkg.add("Conda"); Pkg.build("Conda")'

Another interesting clue is that I can actually build this just fine on my ARM-based laptop (M1). It's only when I try to build the arm64 architecture from an amd64 system (i.e., through docker/QEMU) that this error comes up. Does that offer any insight?

To reproduce this with GitHub actions, you could either build this locally on an x86_64 system, using docker build --platform=linux/arm64 -t test ..

Alternatively, you can create a GitHub action. First, create a Dockerfile in the root directory containing the above. Then, create a workflow file:

name: Docker test
on:
  push:
    branches:
      - "**"
jobs:
  docker:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        arch: [linux/amd64, linux/arm64]
    steps:
      - name: Checkout
        uses: actions/checkout@v3
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v2
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      - name: Build and push
        uses: docker/build-push-action@v3
        with:
          context: .
          platforms: ${{ matrix.arch }}
          push: false

(This could be combined with https://github.com/csexton/debugger-action to interact with it after failure.)

vtjnash commented 1 year ago

The equivalent issue for M1 was fixed for arm64-darwin in the previous (old) release of LLVM, so that would make sense, so you would likely need to get a version of LLVM master working with Julia master before reporting it.

schlichtanders commented 1 year ago

I experience the same Segfault when simply precompiling the TimeZones package. Same setup: multi-architecture build from amd64 host to arm64 target using qemu emulation.

@vtjnash can you point to further issues which could help solving this?

vtjnash commented 1 year ago

https://github.com/JuliaLang/julia/pull/45859