Closed DilumAluthge closed 1 year ago
Is it possible to find a few other builds where this might have happened? It's hard to tell which test it is happening in too.
@Gnimuc Might this be due to some of our recent ccall
changes where we are generating clang wrappers?
I'm not sure. This commit might be related: https://github.com/JuliaLang/SuiteSparse.jl/commit/bb068bbd9848a47e6527336231ded17a9cd1951d. But it should be fixed.
Is SuiteSparse.jl the only user of detect_ambiguities
? I guess no. To me, this looks like the detect_ambiguities
itself is broken.
It looks like the SuiteSparse.jl stdlib was last bumped on August 24: https://github.com/JuliaLang/julia/commits/master/stdlib/SuiteSparse.version
So why are we just starting to see these failures now?
@Gnimuc Where are you seeing the issue with detect_ambiguities
?
caused by: failed process: ... ambiguous compiler/inference compiler/validation compiler/ssair compiler/irpasses compiler/codegen compiler/inline compiler/contextual subarray strings/basic strings/search strings/util strings/io strings/types unicode/utf8 core worlds atomics keywordargs numbers subtype char triplequote intrinsics dict hashing iobuffer staged offsetarray arrayops tuple reduce reducedim abstractarray intfuncs simdloop vecelement rational bitarray copy math fastmath functional iterators operators ordering path ccall parse loading gmp sorting spawn backtrace exceptions file read version namedtuple mpfr broadcast complex floatapprox reflection regex float16 combinatorics sysinfo env rounding ranges mod2pi euler show client errorshow sets goto llvmcall llvmcall2 ryu some meta stacktraces docs misc threads stress binaryplatforms atexit enums cmdlineargs int interpreter checked bitset floatfuncs precompile boundscheck error cartesian osutils channels iostream secretbuffer specificity reinterpretarray syntax corelogging missing asyncmap smallarrayshrink opaque_closure filesystem download SparseArrays/higherorderfns SparseArrays/sparse SparseArrays/sparsevector LinearAlgebra/triangular LinearAlgebra/qr LinearAlgebra/dense LinearAlgebra/matmul LinearAlgebra/schur LinearAlgebra/special LinearAlgebra/eigen LinearAlgebra/bunchkaufman LinearAlgebra/svd LinearAlgebra/lapack LinearAlgebra/tridiag LinearAlgebra/bidiag LinearAlgebra/diagonal LinearAlgebra/cholesky LinearAlgebra/lu LinearAlgebra/symmetric LinearAlgebra/generic LinearAlgebra/uniformscaling LinearAlgebra/lq LinearAlgebra/hessenberg LinearAlgebra/blas LinearAlgebra/adjtrans LinearAlgebra/pinv LinearAlgebra/givens LinearAlgebra/structuredbroadcast LinearAlgebra/addmul LinearAlgebra/ldlt LinearAlgebra/factorization LibGit2/libgit2 Dates/accessors Dates/adjusters Dates/query Dates/periods Dates/ranges Dates/rounding Dates/types Dates/io Dates/arithmetic Dates/conversions ArgTools Artifacts Base64 CRC32c CompilerSupportLibraries_jll DelimitedFiles Distributed Downloads FileWatching Future GMP_jll InteractiveUtils LLVMLibUnwind_jll LazyArtifacts LibCURL LibCURL_jll LibGit2_jll LibSSH2_jll LibUV_jll LibUnwind_jll Libdl Logging MPFR_jll Markdown MbedTLS_jll Mmap MozillaCACerts_jll NetworkOptions OpenBLAS_jll OpenLibm_jll PCRE2_jll Printf Profile REPL Random SHA Serialization SharedArrays Sockets Statistics SuiteSparse SuiteSparse_jll TOML Tar Test UUIDs Unicode Zlib_jll dSFMT_jll libLLVM_jll libblastrampoline_jll nghttp2_jll p7zip_jll LibGit2/online download
This is probably caused by some compiler internal changes related to inference.
That line just lists all of the test sets that Buildbot ran. E.g. Buildbot ran the ambiguous
test set, the compiler/inference
test set, etc.
If you scroll up in the log, you can see which test sets passed and which test sets failed.
All of the test sets are passing except for SuiteSparse. For example, the ambiguous
test set is passing, the compiler/inference
test set is passing, etc.
The issue here is specifically that the Julia process is crashing sometime during the SuiteSparse test set.
Scroll up to e.g. 525 of https://build.julialang.org/#/builders/65/builds/4081/steps/5/logs/stdio, where it says "Worker 7 terminated". That's where the Julia process is crashing.
Maybe these lines are no longer valid.
This can not be the reason. If the size is wrong, then the tests should fail every time.
How could I reproduce this locally?
Do you have a Windows machine locally?
If so, you could maybe try running the SuiteSparse tests in a while
loop and waiting for it to crash?
Yes, I have a Windows machine. Should I build Julia in a cygwin environment or just use the nightly?
I would build from source. Then, repeatedly run Base.runtests(["SuiteSparse"])
.
Should I add any other configuation? I'm running the testsuite with both julia -t 12
and julia -t 1
for about 20 mins, haven't got a crash yet.
It seems relatively rare. You may need to run it for a long time.
Should we disable the detect_ambiguities
test for now? @KristofferC Do you know about this function?
This doesn't have anything to do with detect_ambiguities
.
sorry, I misread those logs.
It seems relatively rare. You may need to run it for a long time.
I'll try it again today.
OK, another question. If Base.runtests(["SuiteSparse"])
could trigger a crash, how could I take a snapshot of the current stack for debugging later. Does rr support Windows now?
No
The crash happens in the first suitsparse test which is detect_ambiguities.
The crash happens in the first suitsparse test which is detect_ambiguities.
How do you know the crash is happening in the first SuiteSparse test?
Looks like the first faliure is triggered by https://github.com/JuliaLang/julia/commit/a512f1a00fec3bbaa96f83f7eabfdfcba739e587.
Could this OPENBLAS_MAIN_FREE
env variable affect suitesparse or its testset?
It shows only 1 error and then crashes (in the tests table). So guessing it is the first or second test.
Looks like the first faliure is triggered by JuliaLang/julia@a512f1a. Could this
OPENBLAS_MAIN_FREE
env variable affects suitesparse or its testset?
Just FYI, I'm not sure if that's the first occurrence. I stopped once I had a few examples.
Looks like the first faliure is triggered by JuliaLang/julia@a512f1a.
Could this
OPENBLAS_MAIN_FREE
env variable affect suitesparse or its testset?
I doubt it.
It shows only 1 error and then crashes (in the tests table). So guessing it is the first or second test.
I think that might be misleading though.
Each test set is run in a separate worker process. Once the worker process has run all of the tests, it reports the results (successes, failures, errors, and broken) back to the main process. At that point, the main process prints the test results to stdout.
The key point here (if I understand correctly) is that the successes/failures/errors/broken information is not communicated back to the main process until the worker process has finished running all of the tests. Therefore, if the worker process crashes before it has finished running all of the tests, all of the successes/failures/errors/broken information will be lost, and therefore the main process will not have any of that information. As a result, the main process only reports that there was "one error" during the SuiteSparse test suite. In reality, we don't know how many tests were run before the worker process crashed.
This is my understanding of it. It would be good if one of the experts (@staticfloat @Keno @vtjnash) can confirm that my understanding and explanation are correct.
https://build.julialang.org/#/builders/65/builds/3866 contains extra logs:
From worker 2:
From worker 2: Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
From worker 2: Exception: EXCEPTION_ACCESS_VIOLATION at 0x11c783caf -- .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
From worker 2: in expression starting at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.8\SuiteSparse\test\umfpack.jl:11
From worker 2: .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
From worker 2: .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
From worker 2: .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
From worker 2: .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
From worker 2: .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
From worker 2: umfpack_di_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
From worker 2: umfpack_di_numeric at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\lib\x86_64-w64-mingw32.jl:2176
From worker 2: #umfpack_numeric!#11 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\src\umfpack.jl:382
From worker 2: umfpack_numeric! at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\src\umfpack.jl:379 [inlined]
From worker 2: #lu#1 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\src\umfpack.jl:203
From worker 2: lu at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\src\umfpack.jl:197
From worker 2: unknown function (ip: 0000000121379370)
From worker 2: macro expansion at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.8\SuiteSparse\test\umfpack.jl:26 [inlined]
From worker 2: macro expansion at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\Test\src\Test.jl:1396 [inlined]
From worker 2: macro expansion at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.8\SuiteSparse\test\umfpack.jl:22 [inlined]
From worker 2: macro expansion at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\Test\src\Test.jl:1321 [inlined]
From worker 2: top-level scope at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.8\SuiteSparse\test\umfpack.jl:12
From worker 2: jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:880
From worker 2: jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:833
From worker 2: ijl_toplevel_eval at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:898 [inlined]
From worker 2: ijl_toplevel_eval_in at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:948
It points to this line which was modified by https://github.com/JuliaLang/SuiteSparse.jl/pull/40. cc @Wimmerer
Is there a way to search a build by its number in Buildbot?
@Gnimuc The only change there is moving from a manual function -> Clang generated one. The signatures do differ slightly (a Ptr{Cvoid}
in the old version while Clang chooses a Ptr{Ptr{Cvoid}}
, but I doubt that's the issue? This is only happening on the threading test right?
According to the log, the error occurs in test\umfpack.jl:22
.
The signatures do differ slightly (a Ptr{Cvoid} in the old version while Clang chooses a Ptr{Ptr{Cvoid}}, but I doubt that's the issue?
This example shows Ptr{Ptr{Cvoid}}
is correct.
BTW, the tmp
variable Vector{Ptr{Cvoid}}(undef, 1)
looks weird to me. I would rather use Ref{Ptr{Cvoid}}(C_NULL)
instead.
I don't think I modified that line, but once we've solved this issue I can go back and clean some things up.
UMFPACK and SPQR, the two solvers where we're seeing errors (this one for UMFPACK, the KLU wrapper is blocked by a similar Windows x86_64 only error for threading SPQR), do use BLAS calls, and both now have errors that shouldn't be related to any recent changes (which were mostly superficial).
Note. This test is failing for Int32
indices (this would naturally be the first failure, so that's the likely reason). I suppose it's possible that we're using an Int64 (long) numeric
struct where we shouldn't be, but that should error on other platforms as well.
For future reference,
Win64 Failures | Caused by SuiteSparse |
---|---|
3866 | ☑️ |
3863 | ☑️ |
3862 | ❎ |
3861 | ❎ |
3849 | ❎ |
3845 | ❎ |
3844 | ❎ |
3841 | ❎ |
3831 | ❎ |
3826 | ❎ |
3815 | ☑️ |
3814 | ☑️ |
3808 | ❎ |
3797 | ❎ |
3788 | ❎ |
3780 | ❎ |
3779 | ❎ |
3778 | ❎ |
3777 | ❎ |
3774 | ❎ |
3772 | ❎ |
3771 | ☑️ |
3767 | ❎ |
3765 | ☑️ |
3764 | ❎ |
3761 | ❎ |
3760 | ❎ |
3757 | ❎ |
3749 | ❎ |
3748 | ❎ |
3743 | ❎ |
3734 | ❎ |
3726 | ❎ |
3724 | ❎ |
3723 | ☑️ |
3717 | ❎ |
3713 | ❎ |
3711 | ☑️ |
3706 | ❎ |
3699 | ☑️ |
3697 | ❎ |
3686 | ☑️ |
3677 | ❎ |
3675 | ❎ |
3670 | ❎ |
3668 | ❎ |
3665 | ❎ |
3660 | ❎ |
3659 | ❎ |
3644 | ❎ |
3640 | ❎ |
3635 | ❎ |
3631 | ❎ |
3630 | ❎ |
3625 | ❎ |
3622 | ❎ |
3615 | ❎ |
Note. This test is failing for
Int32
indices (this would naturally be the first failure, so that's the likely reason). I suppose it's possible that we're using an Int64 (long)numeric
struct where we shouldn't be, but that should error on other platforms as well.
@Wimmerer Could these lines be the reason?
I'm not familiar enough with the internals of SuiteSparse to identify that for sure, but it's a very probable culprit given that's our only failing plat.
I'm not clear on what that's doing though. It's used to set SuiteSparse_long
and SuiteSparse_long_max
to the correct values. But we're not failing when it should be using SuiteSparse_long
, which should be used when we call dl_numeric
not di_numeric
.
E: Why are we doing that? SuiteSparse_long = Clong
should be perfectly fine right? Instead we're doing SuiteSparse_long = __int64
.
E2: Oh, it's because on Windows, Clong is Int32? Weird.
Looking into it, that should be fine. And for this failing test it shouldn't matter, we should be using an Int32 numeric struct. Let me test that is actually happening
SuiteSparse_long
is set to long long
in the BB2 script:
so we do the same thing on the Julia side.
And, I can not reproduce the nondeterministic failure on my machine. I've run like 200 times without a single crash. But the rate on Buildbot is roughly 1 per 25 times on average.
One thing I wasn't aware of looking at umfpack.jl
, is we really shouldn't be doing anything with the umfpack_di
prefix. Our global control umf_ctrl
is initialized by umfpack_dl_defaults
. I'm surprised this works at all. Sometimes both methods work fine, it's not clear in this case.
I can't test whether this is the issue locally of course.
This is why I didn't use a global constant in the KLU wrapper. Each UMFPackLU
should have its own control
field. I have no idea if this is the actual issue in this case, but it is very possible.
Yes, we should certainly avoid any global state, which will prevent multi-threaded code from calling the solvers. I don't know if SuiteSparse itself is thread-safe - and can be called from multiple threads simultaneously. @Wimmerer Can you check with DrTimothyAldenDavis?
I will check with him. There are 2 different issues, that seem to have occurred at a similar time going on here:
umfpack_di_numeric
call. This occurs in single-threaded usage. That is this issue.Both of these occur only on CI, and only on x86_64.
@ViralBShah I have confirmed with Dr. Davis that everything should be thread safe. We'll probably need to remove the global from umfpack.jl which I can do in a bit. But both of the above tests have worked perfectly fine until some sort of changes to nightly IMHO. I'm somewhat hesitant to make changes until we have the original functionality smoothed out.
The trouble of course is that there is no way to test this one since we don't have a reproducer. But all of these suggest that there is some corruption happening somewhere.
If the KLU one is reproducible - maybe fixing that one first helps here? Although I hear what you are saying - that the crashes are in different places.
The main problem is I have no idea how to fix the "KLU issue". It's not an issue with KLU, it started happening with the SPQR threading test randomly on nightly. Code I didn't touch. And I'm not even sure how to diagnose it since it only happens on CI. At least it always happens unlike the UMFPACK issue...
Referring to https://github.com/JuliaLang/SuiteSparse.jl/issues/43#issuecomment-944861236
The SparseArray in the failing test has Int64 indices on Win64, so I would imagine that umfpack_dl_numeric
would be what needs to be called. Either that, or the sparse array is getting converted to one with Int32 indices before umfpack is used.
Here's another log that has the extra details: https://build.julialang.org/#/builders/65/builds/4256
Worth noting that failures are on amd epyc processors.
This is just a random shot in the dark, but I suspect the symptom might look like non-deterministic failures as described here. https://github.com/JuliaLang/SuiteSparse.jl/issues/47 https://github.com/JuliaLang/julia/issues/42673
Julia's jl_calloc
overflows. It may not allocate the memory requested and may return a non-null error despite not having the memory requested:
julia> ccall(:jl_calloc,Ptr{Nothing},(Csize_t, Csize_t), 0xffffffffffffffff, 0xffffffffffffffff)
Ptr{Nothing} @0x000000001ebd1180
It should return a C_NULL
:
julia> Libc.calloc(0xffffffffffffffff, 0xffffffffffffffff)
Ptr{Nothing} @0x0000000000000000
Example log: https://build.julialang.org/#/builders/65/builds/4081