JuliaSparse / Pardiso.jl

Calling the PARDISO library from Julia
Other
100 stars 27 forks source link

Large matrix solve: "julia-debug" received signal SIGSEGV, Segmentation fault. #31

Closed ianwilliamson closed 4 years ago

ianwilliamson commented 6 years ago

When solving a large matrix (~9e7 unknowns) on my cluster I am getting a segmentation fault. However, smaller matrices generated by the same code base can be solved (~2e7 unknowns). I guess it's possible that I'm running out of memory, but I would think that an OutOfMemory error would be thrown in that case.

Just in case this is the output of running my code through a debug build of Julia:

Thread 23 "julia-debug" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaaf5203700 (LWP 50103)]
0x00002aaaabbb4864 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00002aaaabbb4864 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002aaad5130643 in ?? ()
   from /home/------/.julia/v0.6/Pardiso/deps/libpardiso500-GNU481-X86-64.so
#2  0x00002aaad555743e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#3  0x00002aaaab9106fa in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00002aaaabc2cb5d in clone () from /lib/x86_64-linux-gnu/libc.so.6

This is the output of versioninfo()

Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Opteron(tm) Processor 6386 SE
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Piledriver)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, bdver1)
Sacha0 commented 6 years ago

An absolute delight to see you here Ian! :) And much thanks for the report!

KristofferC commented 6 years ago

Does setting MESSAGE_LEVEL_ON reveal anything? Does checkmatrix complain about anything (probably not)? I guess we would need a debug build of the pardiso library to get a better stacktrace since that is where the problem seems to happen.

Since it seems to work correctly for smaller matrices I guess that the way we call things from Pardiso.jl is okay...

ianwilliamson commented 6 years ago

It seems like it doesn't even get far enough for MESSAGE_LEVEL_ON to have any effect. I will give checkmatrix() a try and report back.

ianwilliamson commented 6 years ago

Running the below sequence doesn't indicate any problems.

printstats(ps, A, B)
checkmatrix(ps, A)
checkvec(ps, B)

I recently encountered the segfault on a matrix with 4.94E+07 unknowns. If you advise on how I could enable debugging in Pardiso, I can continue to look into this. Thanks!

KristofferC commented 6 years ago

The "correct" way of going out would be to write the whole call in C with the same input and see if it still fails. In that cade file a bug report to upstream.

ianwilliamson commented 6 years ago

Rather than rewriting my code, would it make sense to save the matrix data and then load it into some simple C-based program? Or would that not be useful?

In case it helps, here's a more detailed output that I got on a recent segmentation fault for a complex matrix with 1E8 unknowns attempting to solve on 10 cores. I'm guessing more detail is needed on the first few lines that pertain to the internals of libpardiso500-GNU481-X86-64.so? Probably this would only make sense to the Pardiso developers anyway?

signal (11): Segmentation fault
while loading /home/---/projects/---/---/---.jl, in expression starting on line 47
unknown function (ip: 0x2b4fb989b3eb)
c_blklu_unsym_risc_pardiso_ at /home/---/.julia/v0.6/Pardiso/deps/libpardiso500-GNU481-X86-64.so (unknown line)
ssnfct_pardiso_ at /home/---/.julia/v0.6/Pardiso/deps/libpardiso500-GNU481-X86-64.so (unknown line)
factorize_pardiso_ at /home/---/.julia/v0.6/Pardiso/deps/libpardiso500-GNU481-X86-64.so (unknown line)
do_all_pardiso_fc_ at /home/---/.julia/v0.6/Pardiso/deps/libpardiso500-GNU481-X86-64.so (unknown line)
pardiso_ccc_ at /home/---/.julia/v0.6/Pardiso/deps/libpardiso500-GNU481-X86-64.so (unknown line)
pardiso_f_ at /home/---/.julia/v0.6/Pardiso/deps/libpardiso500-GNU481-X86-64.so (unknown line)
ccall_pardiso at /home/---/.julia/v0.6/Pardiso/src/project_pardiso.jl:100 [inlined]
pardiso at /home/---/.julia/v0.6/Pardiso/src/Pardiso.jl:263
solve! at /home/---/.julia/v0.6/Pardiso/src/Pardiso.jl:225
solve at /home/---/.julia/v0.6/Pardiso/src/Pardiso.jl:168 [inlined]
solve at /home/---/.julia/v0.6/Pardiso/src/Pardiso.jl:167 [inlined]
#dolinearsolve#49 at /home/---/.julia/v0.6/FDFD/src/./solver/solver.jl:13
...
Allocations: 1920378263 (Pool: 1919378492; Big: 999771); GC: 2761
srun: error: ---: task 0: Segmentation fault
KristofferC commented 6 years ago

I didn't mean to rewrite the whole code, just the specific call to the pardiso C-function with the same parameters and matrix that causes the error to happen.

Perhaps you can also try to use the more step by step fashion as shown in e.g. https://github.com/JuliaSparse/Pardiso.jl/blob/master/examples/exampleunsym.jl instead of calling solve. That might give some other good output?

KristofferC commented 4 years ago

I don't think this is a problem with the wrapper but something with Pardiso itself. Maybe the MKL version works better?