Open MilesCranmer opened 1 year ago
I will also add that I have been running the test suite with --check-bounds=yes
, and there do not appear to be any detected out of bounds errors.
It sounds like you are likely to have a data-race in the code, and either need to disable threading, or perhaps try ThreadSanitizer (on linux) to see if it can catch it
I never see an issue on Linux though, only Windows. Is there a tool to find data races on Windows? (Or maybe to see if the Linux run also experiences the data race, but simply does not segfault over it?)
I can't immediately think of anywhere there could be a race condition in the current code, but I will have a closer look. For the most part:
I suppose maybe there could be an issue of objects not being copied when passed to a thread, and so perhaps the head worker tries to access the same object before fetching...?
Is there any other thing you could think of other than a data race?
Edit: I found what look like a couple chances for data races: https://github.com/MilesCranmer/SymbolicRegression.jl/commit/538c402d9315c7ef29c049a1211403f2ab4c7c22. Let's see if that helps!
Edit 2: Nope, still getting segfaults even after that fix: https://github.com/MilesCranmer/SymbolicRegression.jl/actions/runs/3759279230/jobs/6388650870#step:7:880.
Are there any binaries with ThreadSanitizer built-in? I'm building from source and it's taking quite a while compared to a normal build... nearly 24 hours building now.
@vtjnash I seem to be unable to build Julia with thread-sanitizer. Would you happen to have any advice for using it? I can build with address sanitizer just fine (following this), but thread sanitizer, I encounter various problems. Presumably because that page has much more detailed instructions for address sanitizer, I am probably missing some flags which are not mentioned?
gives me the following error:
...
LINK src/flisp/libflisp-debug.a
LINK src/flisp/flisp-debug
LINK usr/lib/libjulia-internal-debug.so.1.8
LINK usr/lib/libjulia-internal-debug.so.1
LINK usr/lib/libjulia-internal-debug.so
1 warning generated.
LINK usr/lib/libjulia-codegen-debug.so.1.8
LINK usr/lib/libjulia-codegen-debug.so.1
LINK usr/lib/libjulia-codegen-debug.so
JULIA usr/lib/julia/corecompiler.ji
/bin/sh: line 1: 112149 Segmentation fault (core dumped) /dev/shm/thread_sanitizer_v3/usr/bin/julia-debug -C "native" --output-ji /dev/shm/thread_sa$itizer_v3/usr/lib/julia/corecompiler.ji.tmp --startup-file=no --warn-overwri$e=yes -g0 -O0 compiler/compiler.jl
make[1]: *** [sysimage.mk:61: /dev/shm/thread_sanitizer_v3/usr/lib/julia/cor$compiler.ji] Error 139
make: *** [Makefile:82: julia-sysimg-ji] Error 2
I also tried the following alternative env variables, with the same segfault:
JULIA_PRECOMPILE=1
override WITH_GC_DEBUG_ENV=1
override JULIA_BUILD_MODE=debug
export LBT_USE_RTLD_DEEPBIND=0
Each of these I commented out or left as is. Same error. I ran make cleanall
each time and built from scratch.
This takes over 24 hours to complete a single build, with the same combinations as above. It gets a bit further, but in the end, I segfault on building sys.jl
.
If I follow the tutorial on https://docs.julialang.org/en/v1/devdocs/sanitizers/#Example-setup exactly, for ASAN, I can actually build everything. It is only when I attempt to build TSAN do I get an error. Are there any flags not described in the docs which I am missing? Thanks.
Is it possible to build the current version of Julia with thread sanitizer? Here's a docker container which gives a reproducible segfault during the build:
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
git \
wget \
ca-certificates \
curl \
gpg-agent \
software-properties-common \
python3 \
python3-dev \
tar \
xz-utils \
gfortran
RUN wget https://apt.llvm.org/llvm.sh && chmod +x llvm.sh && ./llvm.sh 13
WORKDIR /toolchain
ENV TOOLCHAIN_WORKTREE=/toolchain
ARG JLVERSION=1.8.3
ARG PROCS=4
RUN git clone https://github.com/JuliaLang/julia ${TOOLCHAIN_WORKTREE} && \
cd ${TOOLCHAIN_WORKTREE} && \
git checkout v${JLVERSION}
# Build the toolchain
RUN echo "USE_BINARYBUILDER_LLVM=1" > ${TOOLCHAIN_WORKTREE}/Make.user && \
echo "BUILD_LLVM_CLANG=1" >> ${TOOLCHAIN_WORKTREE}/Make.user
RUN cd ${TOOLCHAIN_WORKTREE} && make -j ${PROCS} -C deps install-llvm install-clang install-llvm-tools
WORKDIR /julia
ENV BUILDDIR=/julia
RUN git clone https://github.com/JuliaLang/julia ${BUILDDIR} && \
cd ${BUILDDIR} && \
git checkout v${JLVERSION}
# Put the above commands into /julia/Make.user:
RUN echo "USECLANG=1" > ${BUILDDIR}/Make.user && \
echo "TOOLCHAIN_WORKTREE=/toolchain" >> ${BUILDDIR}/Make.user && \
echo "TOOLCHAIN=\$(TOOLCHAIN_WORKTREE)/usr/tools" >> ${BUILDDIR}/Make.user && \
echo "override CC=\$(TOOLCHAIN)/clang" >> ${BUILDDIR}/Make.user && \
echo "override CXX=\$(TOOLCHAIN)/clang++" >> ${BUILDDIR}/Make.user && \
echo "export ASAN_SYMBOLIZER_PATH=\$(TOOLCHAIN)/llvm-symbolizer" >> ${BUILDDIR}/Make.user && \
echo "USE_BINARYBUILDER_LLVM=1" >> ${BUILDDIR}/Make.user && \
echo "override SANITIZE=1" >> ${BUILDDIR}/Make.user && \
echo "override SANITIZE_THREAD=1" >> ${BUILDDIR}/Make.user && \
echo "override JULIA_BUILD_MODE=debug" >> ${BUILDDIR}/Make.user && \
echo "JULIA_PRECOMPILE=1" >> ${BUILDDIR}/Make.user && \
echo "export LBT_USE_RTLD_DEEPBIND=0" >> ${BUILDDIR}/Make.user
# Build:
RUN make -j ${PROCS} debug
You can run with, e.g., docker build --build-arg JLVERSION=1.8.3 --build-arg PROCS=8 -t sanitizedjulia .
, and you will see the same segfault I am currently encountering:
1 warning generated.
LINK usr/lib/libjulia-codegen-debug.so.1.8
LINK usr/lib/libjulia-codegen-debug.so.1
LINK usr/lib/libjulia-codegen-debug.so
JULIA usr/libcorecompiler.ji
Segmentation fault (core dumped)
make[1]: *** [sysimage.mk:61: /julia/usr/lib/julia/corecompiler.ji] Error 139
make: *** [Makefile:82: julia-sysimg-ji] Error 2
The command '/bin/sh -c make -j ${PROCS} debug' returned a non-zero code: 2
Okay I finally got it working with TSAN after a couple of weeks of trying to build it.
However, TSAN does not raise a single warning when running my code. So it seems there are no data races after all.
Do you have any other tips for trying to debug this?
Sorry, I take that back. I was running Julia with 1 thread!
Looks like there are indeed some data races. Here are the outputs from a run of SymbolicRegression.EquationSearch
, the main parallel loop. Any advice on how I should interpret this information?
WARNING: ThreadSanitizer: data race (pid=75535)
Read of size 8 at 0x000146850e78 by thread T110:
#0 <null> <null> (0x000390e65198)
#1 <null> <null> (0x000390eac9a0)
#2 <null> <null> (0x000390eb8250)
#3 <null> <null> (0x000390ec40ec)
#4 <null> <null> (0x000390ec4200)
#5 _jl_invoke gf.c:2358 (libjulia-internal-debug.1.8.dylib:arm64+0x47bfc) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#6 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#7 <null> <null> (0x00038e6f87f4)
#8 <null> <null> (0x00038e6f8e84)
#9 _jl_invoke gf.c:2358 (libjulia-internal-debug.1.8.dylib:arm64+0x47bfc) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#10 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#11 jl_apply julia.h:1843 (libjulia-internal-debug.1.8.dylib:arm64+0x8f37c) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#12 start_task task.c:931 (libjulia-internal-debug.1.8.dylib:arm64+0x92d58) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
Previous write of size 8 at 0x000146850e78 by thread T116:
#0 <null> <null> (0x000387494390)
#1 <null> <null> (0x00038e6f8650)
#2 <null> <null> (0x00038e6f8e84)
#3 _jl_invoke gf.c:2358 (libjulia-internal-debug.1.8.dylib:arm64+0x47bfc) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#4 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#5 jl_apply julia.h:1843 (libjulia-internal-debug.1.8.dylib:arm64+0x8f37c) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#6 start_task task.c:931 (libjulia-internal-debug.1.8.dylib:arm64+0x92d58) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
Thread T110 (tid=0, running) created by main thread at:
#0 ijl_new_task task.c:820 (libjulia-internal-debug.1.8.dylib:arm64+0x91b28) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#1 <null> <null> (0x0003876373cc)
#2 <null> <null> (0x0003876002a8)
#3 <null> <null> (0x00038762420c)
#4 <null> <null> (0x00014c4d0138)
#5 <null> <null> (0x00014c4d01fc)
#6 _jl_invoke gf.c:2377 (libjulia-internal-debug.1.8.dylib:arm64+0x47d34) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#7 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#8 jl_apply julia.h:1843 (libjulia-internal-debug.1.8.dylib:arm64+0x86a1c) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#9 do_call interpreter.c:126 (libjulia-internal-debug.1.8.dylib:arm64+0x865a4) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#10 eval_value interpreter.c:215 (libjulia-internal-debug.1.8.dylib:arm64+0x83b8c) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#11 eval_stmt_value interpreter.c:166 (libjulia-internal-debug.1.8.dylib:arm64+0x85a54) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#12 eval_body interpreter.c:594 (libjulia-internal-debug.1.8.dylib:arm64+0x81de0) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#13 jl_interpret_toplevel_thunk interpreter.c:750 (libjulia-internal-debug.1.8.dylib:arm64+0x82ebc) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#14 jl_toplevel_eval_flex toplevel.c:906 (libjulia-internal-debug.1.8.dylib:arm64+0xd1034) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#15 jl_toplevel_eval_flex toplevel.c:850 (libjulia-internal-debug.1.8.dylib:arm64+0xd0798) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#16 ijl_toplevel_eval toplevel.c:915 (libjulia-internal-debug.1.8.dylib:arm64+0xd36e4) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#17 ijl_toplevel_eval_in toplevel.c:965 (libjulia-internal-debug.1.8.dylib:arm64+0xd3b48) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#18 <null> <null> (0x00015ba08628)
#19 <null> <null> (0x00015ba14440)
#20 <null> <null> (0x00015ba20100)
#21 <null> <null> (0x00015bbbd5f4)
#22 <null> <null> (0x00015bbc8034)
#23 <null> <null> (0x00015bbc8084)
#24 _jl_invoke gf.c:2377 (libjulia-internal-debug.1.8.dylib:arm64+0x47d34) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#25 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#26 <null> <null> (0x0001233ac89c)
#27 <null> <null> (0x0001233ac9b8)
#28 _jl_invoke gf.c:2377 (libjulia-internal-debug.1.8.dylib:arm64+0x47d34) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#29 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#30 jl_apply julia.h:1843 (libjulia-internal-debug.1.8.dylib:arm64+0x68118) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#31 jl_f__call_latest builtins.c:774 (libjulia-internal-debug.1.8.dylib:arm64+0x68074) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#32 <null> <null> (0x00012328483c)
#33 <null> <null> (0x000123395878)
#34 <null> <null> (0x0001233a0344)
#35 <null> <null> (0x0001233a04f8)
#36 _jl_invoke gf.c:2377 (libjulia-internal-debug.1.8.dylib:arm64+0x47d34) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#37 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#38 jl_apply julia.h:1843 (libjulia-internal-debug.1.8.dylib:arm64+0x137548) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#39 true_main jlapi.c:575 (libjulia-internal-debug.1.8.dylib:arm64+0x139880) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#40 jl_repl_entrypoint jlapi.c:719 (libjulia-internal-debug.1.8.dylib:arm64+0x139674) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#41 jl_load_repl loader_lib.c:471 (libjulia-debug.1.8.dylib:arm64+0x3010) (BuildId: e0d74e9cb12b3345a6bb92ce3ab1dc7732000000200000000100000000000b00)
#42 main loader_exe.c:59 (julia-debug:arm64+0x100003eec) (BuildId: af341df0d9c53779904c6eb05a1a180b32000000200000000100000000000b00)
Thread T116 (tid=0, running) created by main thread at:
#0 ijl_new_task task.c:820 (libjulia-internal-debug.1.8.dylib:arm64+0x91b28) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#1 <null> <null> (0x0003876373cc)
#2 <null> <null> (0x0003876002a8)
#3 <null> <null> (0x00038762420c)
#4 <null> <null> (0x00014c4d0138)
#5 <null> <null> (0x00014c4d01fc)
#6 _jl_invoke gf.c:2377 (libjulia-internal-debug.1.8.dylib:arm64+0x47d34) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#7 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#8 jl_apply julia.h:1843 (libjulia-internal-debug.1.8.dylib:arm64+0x86a1c) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#9 do_call interpreter.c:126 (libjulia-internal-debug.1.8.dylib:arm64+0x865a4) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#10 eval_value interpreter.c:215 (libjulia-internal-debug.1.8.dylib:arm64+0x83b8c) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#11 eval_stmt_value interpreter.c:166 (libjulia-internal-debug.1.8.dylib:arm64+0x85a54) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#12 eval_body interpreter.c:594 (libjulia-internal-debug.1.8.dylib:arm64+0x81de0) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#13 jl_interpret_toplevel_thunk interpreter.c:750 (libjulia-internal-debug.1.8.dylib:arm64+0x82ebc) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#14 jl_toplevel_eval_flex toplevel.c:906 (libjulia-internal-debug.1.8.dylib:arm64+0xd1034) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#15 jl_toplevel_eval_flex toplevel.c:850 (libjulia-internal-debug.1.8.dylib:arm64+0xd0798) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#16 ijl_toplevel_eval toplevel.c:915 (libjulia-internal-debug.1.8.dylib:arm64+0xd36e4) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#17 ijl_toplevel_eval_in toplevel.c:965 (libjulia-internal-debug.1.8.dylib:arm64+0xd3b48) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#18 <null> <null> (0x00015ba08628)
#19 <null> <null> (0x00015ba14440)
#20 <null> <null> (0x00015ba20100)
#21 <null> <null> (0x00015bbbd5f4)
#22 <null> <null> (0x00015bbc8034)
#23 <null> <null> (0x00015bbc8084)
#24 _jl_invoke gf.c:2377 (libjulia-internal-debug.1.8.dylib:arm64+0x47d34) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#25 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#26 <null> <null> (0x0001233ac89c)
#27 <null> <null> (0x0001233ac9b8)
#28 _jl_invoke gf.c:2377 (libjulia-internal-debug.1.8.dylib:arm64+0x47d34) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#29 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#30 jl_apply julia.h:1843 (libjulia-internal-debug.1.8.dylib:arm64+0x68118) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#31 jl_f__call_latest builtins.c:774 (libjulia-internal-debug.1.8.dylib:arm64+0x68074) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#32 <null> <null> (0x00012328483c)
#33 <null> <null> (0x000123395878)
#34 <null> <null> (0x0001233a0344)
#35 <null> <null> (0x0001233a04f8)
#36 _jl_invoke gf.c:2377 (libjulia-internal-debug.1.8.dylib:arm64+0x47d34) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#37 ijl_apply_generic gf.c:2559 (libjulia-internal-debug.1.8.dylib:arm64+0x47e24) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#38 jl_apply julia.h:1843 (libjulia-internal-debug.1.8.dylib:arm64+0x137548) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#39 true_main jlapi.c:575 (libjulia-internal-debug.1.8.dylib:arm64+0x139880) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#40 jl_repl_entrypoint jlapi.c:719 (libjulia-internal-debug.1.8.dylib:arm64+0x139674) (BuildId: df06725d5d78354dbf18495db0682ac632000000200000000100000000000b00)
#41 jl_load_repl loader_lib.c:471 (libjulia-debug.1.8.dylib:arm64+0x3010) (BuildId: e0d74e9cb12b3345a6bb92ce3ab1dc7732000000200000000100000000000b00)
#42 main loader_exe.c:59 (julia-debug:arm64+0x100003eec) (BuildId: af341df0d9c53779904c6eb05a1a180b32000000200000000100000000000b00)
SUMMARY: ThreadSanitizer: data race (<unknown module>)
I have been seeing segfaults on my Windows CI of SymbolicRegression.jl for maybe ~6 months now, and I am finally throwing in the towel and submitting a bug report.
What seems to be happening is the test suite will experience a segfault at some part of my test suite, randomly at some point through these integration test sets, which happen after the unit tests. The integration tests use multiprocessing, multithreading, and various other compute options, but they are not too strenuous and the Ubuntu and macOS tests always seem to pass fine.
I cannot reproduce these segfaults on a local copy of Windows; I only see them on GitHub actions'
windows-latest
machines. I usually see them on Julia 1.6.7 and Julia 1.7.3, although I have seen them on Julia 1.8.2 as well (but less frequently). If you have any recommendations for how I can get better traces of these segfaults, I would love to hear it. I know of therr
option on Linux, but it seems like there is no good equivalent for Windows.Essentially, the Windows tests will randomly segfault someway through the integration tests. Here are a few examples:
windows-latest
, Julia 1.6.7, commit 367d155. Segfaults at this test (multi-threading; with a few different search settings) - https://github.com/MilesCranmer/SymbolicRegression.jl/blob/367d155f26c5a7f0faf26bf529b95f097f1f7f22/test/test_mixed.jl#L39.windows-latest
, Julia 1.7.3, commit 367d155. Segfaults at the same test, but a little later on:At this commit, the test passes for Julia 1.8.2. All other operating systems pass.
windows-latest
, Julia 1.6.7, commit 81f9544 same error as above.windows-latest
, Julia 1.7.3, commit 81f9544. This one lasts longer than before. I think that this one segfaults here, which is the suite after the above test.Any help would be much appreciated.
These may or may not be related to these segfaults in the PyJulia frontend: https://github.com/MilesCranmer/PySR/issues/238.