JuliaSparse / SparseArrays.jl

SparseArrays.jl is a Julia stdlib
https://sparsearrays.juliasparse.org/
Other
92 stars 52 forks source link

Nondeterministic failures (Julia crashes) on Base Julia CI on `tester_win64` #147

Closed DilumAluthge closed 1 year ago

DilumAluthge commented 3 years ago

Example log: https://build.julialang.org/#/builders/65/builds/4081

ViralBShah commented 3 years ago

Is it possible to find a few other builds where this might have happened? It's hard to tell which test it is happening in too.

@Gnimuc Might this be due to some of our recent ccall changes where we are generating clang wrappers?

DilumAluthge commented 3 years ago

https://build.julialang.org/#/builders/65/builds/4040

https://build.julialang.org/#/builders/65/builds/4039

https://build.julialang.org/#/builders/65/builds/3866

Gnimuc commented 3 years ago

I'm not sure. This commit might be related: https://github.com/JuliaLang/SuiteSparse.jl/commit/bb068bbd9848a47e6527336231ded17a9cd1951d. But it should be fixed.

Is SuiteSparse.jl the only user of detect_ambiguities? I guess no. To me, this looks like the detect_ambiguities itself is broken.

DilumAluthge commented 3 years ago

It looks like the SuiteSparse.jl stdlib was last bumped on August 24: https://github.com/JuliaLang/julia/commits/master/stdlib/SuiteSparse.version

So why are we just starting to see these failures now?

DilumAluthge commented 3 years ago

@Gnimuc Where are you seeing the issue with detect_ambiguities?

Gnimuc commented 3 years ago

caused by: failed process: ... ambiguous compiler/inference compiler/validation compiler/ssair compiler/irpasses compiler/codegen compiler/inline compiler/contextual subarray strings/basic strings/search strings/util strings/io strings/types unicode/utf8 core worlds atomics keywordargs numbers subtype char triplequote intrinsics dict hashing iobuffer staged offsetarray arrayops tuple reduce reducedim abstractarray intfuncs simdloop vecelement rational bitarray copy math fastmath functional iterators operators ordering path ccall parse loading gmp sorting spawn backtrace exceptions file read version namedtuple mpfr broadcast complex floatapprox reflection regex float16 combinatorics sysinfo env rounding ranges mod2pi euler show client errorshow sets goto llvmcall llvmcall2 ryu some meta stacktraces docs misc threads stress binaryplatforms atexit enums cmdlineargs int interpreter checked bitset floatfuncs precompile boundscheck error cartesian osutils channels iostream secretbuffer specificity reinterpretarray syntax corelogging missing asyncmap smallarrayshrink opaque_closure filesystem download SparseArrays/higherorderfns SparseArrays/sparse SparseArrays/sparsevector LinearAlgebra/triangular LinearAlgebra/qr LinearAlgebra/dense LinearAlgebra/matmul LinearAlgebra/schur LinearAlgebra/special LinearAlgebra/eigen LinearAlgebra/bunchkaufman LinearAlgebra/svd LinearAlgebra/lapack LinearAlgebra/tridiag LinearAlgebra/bidiag LinearAlgebra/diagonal LinearAlgebra/cholesky LinearAlgebra/lu LinearAlgebra/symmetric LinearAlgebra/generic LinearAlgebra/uniformscaling LinearAlgebra/lq LinearAlgebra/hessenberg LinearAlgebra/blas LinearAlgebra/adjtrans LinearAlgebra/pinv LinearAlgebra/givens LinearAlgebra/structuredbroadcast LinearAlgebra/addmul LinearAlgebra/ldlt LinearAlgebra/factorization LibGit2/libgit2 Dates/accessors Dates/adjusters Dates/query Dates/periods Dates/ranges Dates/rounding Dates/types Dates/io Dates/arithmetic Dates/conversions ArgTools Artifacts Base64 CRC32c CompilerSupportLibraries_jll DelimitedFiles Distributed Downloads FileWatching Future GMP_jll InteractiveUtils LLVMLibUnwind_jll LazyArtifacts LibCURL LibCURL_jll LibGit2_jll LibSSH2_jll LibUV_jll LibUnwind_jll Libdl Logging MPFR_jll Markdown MbedTLS_jll Mmap MozillaCACerts_jll NetworkOptions OpenBLAS_jll OpenLibm_jll PCRE2_jll Printf Profile REPL Random SHA Serialization SharedArrays Sockets Statistics SuiteSparse SuiteSparse_jll TOML Tar Test UUIDs Unicode Zlib_jll dSFMT_jll libLLVM_jll libblastrampoline_jll nghttp2_jll p7zip_jll LibGit2/online download

Gnimuc commented 3 years ago

This is probably caused by some compiler internal changes related to inference.

DilumAluthge commented 3 years ago

That line just lists all of the test sets that Buildbot ran. E.g. Buildbot ran the ambiguous test set, the compiler/inference test set, etc.

If you scroll up in the log, you can see which test sets passed and which test sets failed.

All of the test sets are passing except for SuiteSparse. For example, the ambiguous test set is passing, the compiler/inference test set is passing, etc.

DilumAluthge commented 3 years ago

The issue here is specifically that the Julia process is crashing sometime during the SuiteSparse test set.

Scroll up to e.g. 525 of https://build.julialang.org/#/builders/65/builds/4081/steps/5/logs/stdio, where it says "Worker 7 terminated". That's where the Julia process is crashing.

Gnimuc commented 3 years ago

Maybe these lines are no longer valid.

This can not be the reason. If the size is wrong, then the tests should fail every time.

Gnimuc commented 3 years ago

How could I reproduce this locally?

DilumAluthge commented 3 years ago

Do you have a Windows machine locally?

If so, you could maybe try running the SuiteSparse tests in a while loop and waiting for it to crash?

Gnimuc commented 3 years ago

Yes, I have a Windows machine. Should I build Julia in a cygwin environment or just use the nightly?

DilumAluthge commented 3 years ago

I would build from source. Then, repeatedly run Base.runtests(["SuiteSparse"]).

Gnimuc commented 3 years ago

Should I add any other configuation? I'm running the testsuite with both julia -t 12 and julia -t 1 for about 20 mins, haven't got a crash yet.

DilumAluthge commented 3 years ago

It seems relatively rare. You may need to run it for a long time.

ViralBShah commented 3 years ago

Should we disable the detect_ambiguities test for now? @KristofferC Do you know about this function?

DilumAluthge commented 3 years ago

This doesn't have anything to do with detect_ambiguities.

Gnimuc commented 3 years ago

sorry, I misread those logs.

Gnimuc commented 3 years ago

It seems relatively rare. You may need to run it for a long time.

I'll try it again today.

Gnimuc commented 3 years ago

OK, another question. If Base.runtests(["SuiteSparse"]) could trigger a crash, how could I take a snapshot of the current stack for debugging later. Does rr support Windows now?

ViralBShah commented 3 years ago

No

ViralBShah commented 3 years ago

The crash happens in the first suitsparse test which is detect_ambiguities.

DilumAluthge commented 3 years ago

The crash happens in the first suitsparse test which is detect_ambiguities.

How do you know the crash is happening in the first SuiteSparse test?

Gnimuc commented 3 years ago

Looks like the first faliure is triggered by https://github.com/JuliaLang/julia/commit/a512f1a00fec3bbaa96f83f7eabfdfcba739e587.

Could this OPENBLAS_MAIN_FREE env variable affect suitesparse or its testset?

ViralBShah commented 3 years ago

It shows only 1 error and then crashes (in the tests table). So guessing it is the first or second test.

DilumAluthge commented 3 years ago

Looks like the first faliure is triggered by JuliaLang/julia@a512f1a. Could this OPENBLAS_MAIN_FREE env variable affects suitesparse or its testset?

Just FYI, I'm not sure if that's the first occurrence. I stopped once I had a few examples.

ViralBShah commented 3 years ago

Looks like the first faliure is triggered by JuliaLang/julia@a512f1a.

Could this OPENBLAS_MAIN_FREE env variable affect suitesparse or its testset?

I doubt it.

DilumAluthge commented 3 years ago

It shows only 1 error and then crashes (in the tests table). So guessing it is the first or second test.

I think that might be misleading though.

Each test set is run in a separate worker process. Once the worker process has run all of the tests, it reports the results (successes, failures, errors, and broken) back to the main process. At that point, the main process prints the test results to stdout.

The key point here (if I understand correctly) is that the successes/failures/errors/broken information is not communicated back to the main process until the worker process has finished running all of the tests. Therefore, if the worker process crashes before it has finished running all of the tests, all of the successes/failures/errors/broken information will be lost, and therefore the main process will not have any of that information. As a result, the main process only reports that there was "one error" during the SuiteSparse test suite. In reality, we don't know how many tests were run before the worker process crashed.

This is my understanding of it. It would be good if one of the experts (@staticfloat @Keno @vtjnash) can confirm that my understanding and explanation are correct.

Gnimuc commented 3 years ago

https://build.julialang.org/#/builders/65/builds/3866 contains extra logs:

  From worker 2:    
      From worker 2:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 2:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x11c783caf -- .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 2:    in expression starting at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.8\SuiteSparse\test\umfpack.jl:11
      From worker 2:    .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 2:    .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 2:    .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 2:    .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 2:    .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 2:    umfpack_di_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 2:    umfpack_di_numeric at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\lib\x86_64-w64-mingw32.jl:2176
      From worker 2:    #umfpack_numeric!#11 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\src\umfpack.jl:382
      From worker 2:    umfpack_numeric! at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\src\umfpack.jl:379 [inlined]
      From worker 2:    #lu#1 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\src\umfpack.jl:203
      From worker 2:    lu at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\SuiteSparse\src\umfpack.jl:197
      From worker 2:    unknown function (ip: 0000000121379370)
      From worker 2:    macro expansion at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.8\SuiteSparse\test\umfpack.jl:26 [inlined]
      From worker 2:    macro expansion at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\Test\src\Test.jl:1396 [inlined]
      From worker 2:    macro expansion at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.8\SuiteSparse\test\umfpack.jl:22 [inlined]
      From worker 2:    macro expansion at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\Test\src\Test.jl:1321 [inlined]
      From worker 2:    top-level scope at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.8\SuiteSparse\test\umfpack.jl:12
      From worker 2:    jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:880
      From worker 2:    jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:833
      From worker 2:    ijl_toplevel_eval at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:898 [inlined]
      From worker 2:    ijl_toplevel_eval_in at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:948

It points to this line which was modified by https://github.com/JuliaLang/SuiteSparse.jl/pull/40. cc @Wimmerer

Gnimuc commented 3 years ago

Is there a way to search a build by its number in Buildbot?

rayegun commented 3 years ago

@Gnimuc The only change there is moving from a manual function -> Clang generated one. The signatures do differ slightly (a Ptr{Cvoid} in the old version while Clang chooses a Ptr{Ptr{Cvoid}}, but I doubt that's the issue? This is only happening on the threading test right?

Gnimuc commented 3 years ago

According to the log, the error occurs in test\umfpack.jl:22.

The signatures do differ slightly (a Ptr{Cvoid} in the old version while Clang chooses a Ptr{Ptr{Cvoid}}, but I doubt that's the issue?

This example shows Ptr{Ptr{Cvoid}} is correct.

BTW, the tmp variable Vector{Ptr{Cvoid}}(undef, 1) looks weird to me. I would rather use Ref{Ptr{Cvoid}}(C_NULL) instead.

rayegun commented 3 years ago

I don't think I modified that line, but once we've solved this issue I can go back and clean some things up.

UMFPACK and SPQR, the two solvers where we're seeing errors (this one for UMFPACK, the KLU wrapper is blocked by a similar Windows x86_64 only error for threading SPQR), do use BLAS calls, and both now have errors that shouldn't be related to any recent changes (which were mostly superficial).

Note. This test is failing for Int32 indices (this would naturally be the first failure, so that's the likely reason). I suppose it's possible that we're using an Int64 (long) numeric struct where we shouldn't be, but that should error on other platforms as well.

Gnimuc commented 3 years ago

For future reference,

Win64 Failures Caused by SuiteSparse
3866 ☑️
3863 ☑️
3862
3861
3849
3845
3844
3841
3831
3826
3815 ☑️
3814 ☑️
3808
3797
3788
3780
3779
3778
3777
3774
3772
3771 ☑️
3767
3765 ☑️
3764
3761
3760
3757
3749
3748
3743
3734
3726
3724
3723 ☑️
3717
3713
3711 ☑️
3706
3699 ☑️
3697
3686 ☑️
3677
3675
3670
3668
3665
3660
3659
3644
3640
3635
3631
3630
3625
3622
3615
Gnimuc commented 3 years ago

Note. This test is failing for Int32 indices (this would naturally be the first failure, so that's the likely reason). I suppose it's possible that we're using an Int64 (long) numeric struct where we shouldn't be, but that should error on other platforms as well.

@Wimmerer Could these lines be the reason?

https://github.com/JuliaLang/SuiteSparse.jl/blob/b15c39be53f7823c721c1f8a7c036105e2baa04a/src/LibSuiteSparse.jl#L8-L13

rayegun commented 3 years ago

I'm not familiar enough with the internals of SuiteSparse to identify that for sure, but it's a very probable culprit given that's our only failing plat.

I'm not clear on what that's doing though. It's used to set SuiteSparse_long and SuiteSparse_long_max to the correct values. But we're not failing when it should be using SuiteSparse_long, which should be used when we call dl_numeric not di_numeric.

E: Why are we doing that? SuiteSparse_long = Clong should be perfectly fine right? Instead we're doing SuiteSparse_long = __int64.

E2: Oh, it's because on Windows, Clong is Int32? Weird.

rayegun commented 3 years ago

Looking into it, that should be fine. And for this failing test it shouldn't matter, we should be using an Int32 numeric struct. Let me test that is actually happening

Gnimuc commented 3 years ago

SuiteSparse_long is set to long long in the BB2 script:

https://github.com/JuliaPackaging/Yggdrasil/blob/a491caf99504e08f9a9ee3878a54c59c0bb81d45/S/SuiteSparse/SuiteSparse/build_tarballs.jl#L30

so we do the same thing on the Julia side.

Gnimuc commented 3 years ago

And, I can not reproduce the nondeterministic failure on my machine. I've run like 200 times without a single crash. But the rate on Buildbot is roughly 1 per 25 times on average.

rayegun commented 3 years ago

One thing I wasn't aware of looking at umfpack.jl, is we really shouldn't be doing anything with the umfpack_di prefix. Our global control umf_ctrl is initialized by umfpack_dl_defaults. I'm surprised this works at all. Sometimes both methods work fine, it's not clear in this case.

I can't test whether this is the issue locally of course.

rayegun commented 3 years ago

This is why I didn't use a global constant in the KLU wrapper. Each UMFPackLU should have its own control field. I have no idea if this is the actual issue in this case, but it is very possible.

ViralBShah commented 3 years ago

Yes, we should certainly avoid any global state, which will prevent multi-threaded code from calling the solvers. I don't know if SuiteSparse itself is thread-safe - and can be called from multiple threads simultaneously. @Wimmerer Can you check with DrTimothyAldenDavis?

rayegun commented 3 years ago

I will check with him. There are 2 different issues, that seem to have occurred at a similar time going on here:

  1. UMFPackLU fails on Windows x86_64 on the umfpack_di_numeric call. This occurs in single-threaded usage. That is this issue.
  2. Tests from the KLU PR fail on the unrelated threading test of SPQR on Windows x86_64

Both of these occur only on CI, and only on x86_64.

@ViralBShah I have confirmed with Dr. Davis that everything should be thread safe. We'll probably need to remove the global from umfpack.jl which I can do in a bit. But both of the above tests have worked perfectly fine until some sort of changes to nightly IMHO. I'm somewhat hesitant to make changes until we have the original functionality smoothed out.

ViralBShah commented 3 years ago

The trouble of course is that there is no way to test this one since we don't have a reproducer. But all of these suggest that there is some corruption happening somewhere.

If the KLU one is reproducible - maybe fixing that one first helps here? Although I hear what you are saying - that the crashes are in different places.

rayegun commented 3 years ago

The main problem is I have no idea how to fix the "KLU issue". It's not an issue with KLU, it started happening with the SPQR threading test randomly on nightly. Code I didn't touch. And I'm not even sure how to diagnose it since it only happens on CI. At least it always happens unlike the UMFPACK issue...

ViralBShah commented 3 years ago

Referring to https://github.com/JuliaLang/SuiteSparse.jl/issues/43#issuecomment-944861236

The SparseArray in the failing test has Int64 indices on Win64, so I would imagine that umfpack_dl_numeric would be what needs to be called. Either that, or the sparse array is getting converted to one with Int32 indices before umfpack is used.

DilumAluthge commented 3 years ago

Here's another log that has the extra details: https://build.julialang.org/#/builders/65/builds/4256

ViralBShah commented 3 years ago

Worth noting that failures are on amd epyc processors.

mkitti commented 3 years ago

This is just a random shot in the dark, but I suspect the symptom might look like non-deterministic failures as described here. https://github.com/JuliaLang/SuiteSparse.jl/issues/47 https://github.com/JuliaLang/julia/issues/42673

Julia's jl_calloc overflows. It may not allocate the memory requested and may return a non-null error despite not having the memory requested:

julia> ccall(:jl_calloc,Ptr{Nothing},(Csize_t, Csize_t), 0xffffffffffffffff, 0xffffffffffffffff)
Ptr{Nothing} @0x000000001ebd1180

It should return a C_NULL:

julia> Libc.calloc(0xffffffffffffffff, 0xffffffffffffffff)
Ptr{Nothing} @0x0000000000000000