Closed lassepe closed 2 years ago
We don't do anything special, so I assume the underlying solver is not thread safe.
I did a quick run with GDB and I get the segfault here. But maybe that's not the actual point of segfault and Julia just happened to GC when this happened. Here's the GDB backtrace:
Thread 3 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe2f3b700 (LWP 199218)]
0x00007ffff752c3e3 in jl_mutex_wait (lock=0x7ffff7888920 <codegen_lock>, safepoint=1)
at /home/lassepe/worktree/julia/src/julia_locks.h:34
34 jl_gc_safepoint_(self->ptls);
(gdb) bt
#0 0x00007ffff752c3e3 in jl_mutex_wait (lock=0x7ffff7888920 <codegen_lock>, safepoint=1)
at /home/lassepe/worktree/julia/src/julia_locks.h:34
#1 0x00007ffff752c536 in jl_mutex_lock (lock=0x7ffff7888920 <codegen_lock>)
at /home/lassepe/worktree/julia/src/julia_locks.h:85
#2 0x00007ffff752db6d in jl_generate_fptr (mi=0x7ffff0a17350, world=31344)
at /home/lassepe/worktree/julia/src/jitlayers.cpp:316
#3 0x00007ffff7481916 in jl_compile_method_internal (mi=0x7ffff0a17350, world=31344)
at /home/lassepe/worktree/julia/src/gf.c:1980
#4 0x00007ffff748275e in _jl_invoke (F=0x7ffff086f1b0, args=0x7fffef1f3408, nargs=0, mfunc=0x7ffff0a17350, world=31344)
at /home/lassepe/worktree/julia/src/gf.c:2239
#5 0x00007ffff7483185 in jl_apply_generic (F=0x7ffff086f1b0, args=0x7fffef1f3408, nargs=0)
at /home/lassepe/worktree/julia/src/gf.c:2429
#6 0x00007ffff74a8cdc in jl_apply (args=0x7fffef1f3400, nargs=1) at /home/lassepe/worktree/julia/src/julia.h:1788
#7 0x00007ffff74aac20 in start_task () at /home/lassepe/worktree/julia/src/task.c:877
That just means the issue is probably in us cleaning up an instance of Path while another is running, and that the two instances share some global state in C. If I recall, there's a workspace feature of the C library that we aren't using it at the moment. But not sure if that would help.
Thank you for your swift response. I dug a little deeper and it indeed seems like this is a global state issue inside PATH. At least the function signature of the solver suggests that there is only one global solver object.
FUN_DECL(Void) Path_Create(Int maxSize, Int maxNNZ)
PATHSolver.jl never calls that directly but I guess that would be created as a result of the call to Path_Solve
. So when the MCP
is cleaned up one probably destroys the workspace out from under the other solver running on the same workspace in parallel.
So when the MCP is cleaned up one probably destroys the workspace out from under the other solver running on the same workspace in parallel.
Yes. We can probably fix this by manually creating and managing our own workspace, but I don't know what this entails. You'd have to look into the C code. I'll review a PR if you have time to work on it :smile:
You can find the C header files here: http://www.cs.wisc.edu/~ferris/path/path_5.0.05_Linux64.zip
A colleague of mine contacted the PATH team and Steven Dirkse replied:
Unfortunately, PATH is indeed known to NOT be thread-safe. There are multiple reasons for this that go deep into the way PATH is organized. We spent some time considering this issue some time ago and concluded that the fixes would require a substantial amount of effort, something we could not commit to then or since.
I cannot suggest any workaround I would feel comfortable and safe with, apart from avoiding multi-threaded use of PATH.
Therefore, I guess there's not much we can do on this end for now. A note on this issue in the README or somewhere in the docs may be the best solution for now.
Okay, I'll add a note. Thanks for digging into this further!
I'll also say that threading might be broken, but you can still use process-based parallelism:
import Distributed
Distributed.addprocs(4)
Distributed.@everywhere begin
import Pkg
Pkg.activate("/tmp/path") # Or wherever your environment is
end
Distributed.@everywhere begin
import PATHSolver
import SparseArrays
function solve_example(i)
println("Solving i=$i on node $(Distributed.myid())")
M = convert(
SparseArrays.SparseMatrixCSC{Cdouble,Cint},
SparseArrays.sparse([
0 0 -1 -1
0 0 1 -2
1 -1 2 -2
1 2 -2 4
]),
)
return PATHSolver.solve_mcp(
M,
Cdouble[2, 2, -2, -6],
fill(0.0, 4),
fill(10.0, 4),
Cdouble[0.0, 0.0, 0.0, 0.0],
output = "no",
)
end
end
results = Distributed.pmap(solve_example, 1:10)
Right, multi-processing works because each process loads the lib separately. Unfortunately, in my current usecase the problems are very small so multi-processing adds too much overhead to provide an overall speedup. Still good to keep in mind for other settings though.
Should I close this issue since it's not real actionable or do you want to keep it around as additional warning/reminder?
I'll close it when I merge #63
When I call the PATH solver from multiple threads at the same time, I reliably run into a segfault. Is there a workaround to avoid this?
Here is an MWE (make sure to start the Julia REPL with multiple threads):