JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.92k stars 5.49k forks source link

CI: Stack smash in SuiteSparse #46228

Closed Keno closed 2 years ago

Keno commented 2 years ago

We frequently see the win64 builder crash in SuiteSparse. There's some discussion here: https://github.com/JuliaSparse/SparseArrays.jl/issues/147, but I figured I'd open a new issue with some investigation results.

I was able to reproduce this locally with a VM with 32GiB of memory, but not one with 16GiB of memory, which suggests that this may be GC interval or at the very least test-order dependent. I did eventually manage to catch this in the debugger, but by all appearances the stack was smashed. As a result, I would also not put too much credence into any of the stack traces produced by CI.

Keno commented 2 years ago

Next attempt: What happens if we build SuiteSparse with -fstack-protector

SuiteSparse.v5.10.1.x86_64-w64-mingw32.tar.gz

Keno commented 2 years ago

Stack protector was a bust. I'm pursuing two options in parallel now:

  1. Try the windows version of rr (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/time-travel-debugging-overview) - unfortunately it's much slower than rr.
  2. Try building with msan.
ViralBShah commented 2 years ago

cc @Wimmerer

DrTimothyAldenDavis commented 2 years ago

I haven't seen this, but I typically don't do any testing on Windows at all for SuiteSparse (UMFPACK, KLU, CHOLMOD, GraphBLAS, etc). What packages in SuiteSparse cause this?

rayegun commented 2 years ago

The ones tested as part of Julia (SPQR, UMFPACK, CHOLMOD), have all had some sort of random CI errors recently.

Hopefully Keno finds something solid, likely to be with the way we wrap or build SuiteSparse if you've never gotten a report of something like this.

Keno commented 2 years ago

I have a handle on this. Will post results in a day or two.

ViralBShah commented 2 years ago

I am closing this, since this seems to have got resolved - but we should open a new issue (or reopen this) if necessary.

Keno commented 2 years ago

It didn't get resolved, it just went away when we put it back into the sysimg, but yeah, we can close the issue, since it's not an active problem.