JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.93k stars 5.49k forks source link

Segfault during PackageCompiler.create_app() on v1.5.1 (not v1.5.0) #37288

Closed NHDaly closed 4 years ago

NHDaly commented 4 years ago

Our static compilation build has started segfaulting consistently when we upgraded to julia v1.5.1, where it wasn't segfaulting in v1.5.0.

signal (11): Segmentation fault
in expression starting at none:0
jl_create_native at /buildworker/worker/package_linux64/build/src/aotcompile.cpp:310
jl_precompile at /buildworker/worker/package_linux64/build/src/precompile.c:408
jl_write_compiler_output at /buildworker/worker/package_linux64/build/src/precompile.c:33
jl_atexit_hook at /buildworker/worker/package_linux64/build/src/init.c:218
main at /buildworker/worker/package_linux64/build/ui/repl.c:228
__libc_start_main at /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6 (unknown line)
_start at /nix/store/10svy949354fyfpzzphiyv26v015cwwr-julia-1.5.1/bin/julia (unknown line)
Allocations: 2630782722 (Pool: 2596317092; Big: 34465630); GC: 1202
ERROR: LoadError: failed process: Process(`/nix/store/10svy949354fyfpzzphiyv26v015cwwr-julia-1.5.1/bin/julia --color=yes --startup-file=no '--cpu-target=generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)' --sysimage=/tmp/nix-build-delve-binary-linux.drv-0/jl_POqxB7/tmp_sys.so --project=/nix/store/34m8425v31562jzff7z23mplk6zzpk7y-delve-binary-linux/delve/ --output-o=/tmp/nix-build-delve-binary-linux.drv-0/jl_P6YXu8.o -e 'Base.reinit_stdio()

This is new, and wasn't happening on v1.5.0. It's consistently reproducible on macOS and linux.

Here's a failure from my mac; looks similar:

[ Info: PackageCompiler: creating system image object file, this might take a while...

signal (11): Segmentation fault: 11
in expression starting at none:0
jl_create_native at /Users/julia/buildbot/worker/package_macos64/build/src/aotcompile.cpp:310
jl_precompile at /Users/julia/buildbot/worker/package_macos64/build/src/precompile.c:408
jl_write_compiler_output at /Users/julia/buildbot/worker/package_macos64/build/src/precompile.c:33
jl_atexit_hook at /Users/julia/buildbot/worker/package_macos64/build/src/init.c:218
main at /Applications/Julia-1.5.app/Contents/Resources/julia/bin/julia (unknown line)
Allocations: 2679683979 (Pool: 2645060366; Big: 34623613); GC: 1036

We'll work on trying to get an rr recording to share.

KristofferC commented 4 years ago

Bisecting should be fairly quick since there are so few commits that differ between the releases.

NHDaly commented 4 years ago

Great idea, @KristofferC, thanks

NHDaly commented 4 years ago

Update: it turns out it does segfault on v1.5.0 on my macbook as well. :( But it seemed that 1.5.0 passed on the build farm (linux), but i guess i don't know how useful that is then if it's failing on macOS. I'll try a bisect from 1.4.0 to 1.5.0 tonight.

Also, we tried recording with rr, but the rr replay of PackageCompiler fails. From @rbvermaa:

I had issues with temp files that a replay tried to read that are gone.

Maybe we can make replay work with some hacks to write them to a fixed location. In general, is this a known failure mode for using rr with PackageCompiler?

NHDaly commented 4 years ago

Ugh. I wanted to bisect from 1.4.0 to 1.5.0, but i forgot that our project didn't build on 1.4.0:

└ @ PackageCompiler ~/.julia/packages/PackageCompiler/vsMJE/src/PackageCompiler.jl:516
ERROR: LoadError: MethodError: no method matching source_path(::Pkg.Types.Context, ::Pkg.Types.PackageSpec)
Stacktrace:
 [1] source_path(::Pkg.Types.Context, ::Pkg.Types.PackageSpec) at /Users/daly/.julia/packages/PackageCompiler/vsMJE/src/PackageCompiler.jl:49
 [2] audit_app(::Pkg.Types.Context) at /Users/daly/.julia/packages/PackageCompiler/vsMJE/src/PackageCompiler.jl:520
 [3] create_app(::String, ::String; app_name::String, precompile_execution_file::String, precompile_statements_file::Array{String,1}, incremental::Bool, filte
r_stdlibs::Bool, audit::Bool, force::Bool, c_driver_program::String, cpu_target::String) at /Users/daly/.julia/packages/PackageCompiler/vsMJE/src/PackageCompi
ler.jl:612

It builds fine on 1.4.2, but i can't bisect from 1.4.2 to 1.5.0 because they're on different branches. Does this error look familiar to you? Do you remember if there's a commit i can cherry-pick for it? Thanks, sorry this is annoyingly harder to debug than i'd like.

Sacha0 commented 4 years ago

Perhaps you can match the last backport commit for 1.4.2 to the corresponding commit on 1.5-dev, and then bisect from that commit on 1.5-dev to 1.5.0?

rbvermaa commented 4 years ago

Also, we tried recording with rr, but the rr replay of PackageCompiler fails. From @rbvermaa:

I had issues with temp files that a replay tried to read that are gone.

Maybe we can make replay work with some hacks to write them to a fixed location. In general, is this a known failure mode for using rr with PackageCompiler?

I wonder if I did something wrong initially, I am able to replay a newly created rr recording.

KristofferC commented 4 years ago

I don't really see how temp files should matter. Isn't the point of rr that it records those things?

rbvermaa commented 4 years ago

I don't really see how temp files should matter. Isn't the point of rr that it records those things?

@KristofferC Indeed, that was my understanding as well. The most likely scenario is that I made some mistake, given I was able to replay now successfully without the issue I had before. Will upload the rr recording as soon as I figure out the best way to share it.

NHDaly commented 4 years ago

@KristofferC - we've emailed you all the RR trace. sorry we can't upload it here since it contains sensitive info.

We'd super appreciate it if someone could take a look! ❤️

KristofferC commented 4 years ago

Since this is likely something in the guts of the compiler and not really related to PackageCompiler itself I think me and @JeffBezanson thought it is probably more time-efficient for him to look into it first and see if he can find something obvious.

NHDaly commented 4 years ago

Thanks @JeffBezanson for the quick fix! :)

For anyone following along on the internet, Jeff sent us this patch, which indeed fixed the segfault:

--- a/src/aotcompile.cpp
+++ b/src/aotcompile.cpp
@@ -307,9 +307,11 @@ void *jl_create_native(jl_array_t *methods, const jl_cgparams_t cgparams, int _p
                 }
                 if (src == NULL || !jl_is_code_info(src)) {
                     src = jl_type_infer(mi, params.world, 0);
-                    codeinst = jl_get_method_inferred(mi, src->rettype, src->min_world, src->max_world);
-                    if (src->inferred && !codeinst->inferred)
-                        codeinst->inferred = jl_nothing;
+                    if (src) {
+                        codeinst = jl_get_method_inferred(mi, src->rettype, src->min_world, src->max_world);
+                        if (src->inferred && !codeinst->inferred)
+                            codeinst->inferred = jl_nothing;
+                    }
                 }

Thanks.

Sacha0 commented 4 years ago

(Ref. #37386 for inclusion of this patch in 1.5.2.)

Sacha0 commented 4 years ago

(Ref. #37406 for an equivalent patch for 1.6-dev.)

Sacha0 commented 4 years ago

Resolved on both 1.5.2 and 1.6-dev via the pull requests linked just above. Perhaps close? :)