chkwon / PATHSolver.jl

provides a Julia wrapper for the PATH Solver for solving mixed complementarity problems
http://pages.cs.wisc.edu/%7Eferris/path.html
MIT License
51 stars 15 forks source link

Thread Safety #62

Closed lassepe closed 2 years ago

lassepe commented 2 years ago

When I call the PATH solver from multiple threads at the same time, I reliably run into a segfault. Is there a workaround to avoid this?

Here is an MWE (make sure to start the Julia REPL with multiple threads):

using PATHSolver: PATHSolver
using SparseArrays: SparseArrays

PATHSolver.c_api_License_SetString("2830898829&Courtesy&&&USR&45321&5_1_2021&1000&PATH&GEN&31_12_2025&0_0_0&6000&0_0")

function solve_example()
    M = convert(
        SparseArrays.SparseMatrixCSC{Cdouble,Cint},
        SparseArrays.sparse([
            0 0 -1 -1
            0 0 1 -2
            1 -1 2 -2
            1 2 -2 4
        ]),
    )
    status, z, info = PATHSolver.solve_mcp(
        M,
        Float64[2, 2, -2, -6],
        fill(0.0, 4),
        fill(10.0, 4),
        [0.0, 0.0, 0.0, 0.0];
        output = "yes"
    )
end

Threads.@threads for _ in 1:10
    solve_example()
end
odow commented 2 years ago

We don't do anything special, so I assume the underlying solver is not thread safe.

lassepe commented 2 years ago

I did a quick run with GDB and I get the segfault here. But maybe that's not the actual point of segfault and Julia just happened to GC when this happened. Here's the GDB backtrace:

Thread 3 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe2f3b700 (LWP 199218)]
0x00007ffff752c3e3 in jl_mutex_wait (lock=0x7ffff7888920 <codegen_lock>, safepoint=1)
    at /home/lassepe/worktree/julia/src/julia_locks.h:34
34              jl_gc_safepoint_(self->ptls);
(gdb) bt
#0  0x00007ffff752c3e3 in jl_mutex_wait (lock=0x7ffff7888920 <codegen_lock>, safepoint=1)
    at /home/lassepe/worktree/julia/src/julia_locks.h:34
#1  0x00007ffff752c536 in jl_mutex_lock (lock=0x7ffff7888920 <codegen_lock>)
    at /home/lassepe/worktree/julia/src/julia_locks.h:85
#2  0x00007ffff752db6d in jl_generate_fptr (mi=0x7ffff0a17350, world=31344)
    at /home/lassepe/worktree/julia/src/jitlayers.cpp:316
#3  0x00007ffff7481916 in jl_compile_method_internal (mi=0x7ffff0a17350, world=31344)
    at /home/lassepe/worktree/julia/src/gf.c:1980
#4  0x00007ffff748275e in _jl_invoke (F=0x7ffff086f1b0, args=0x7fffef1f3408, nargs=0, mfunc=0x7ffff0a17350, world=31344)
    at /home/lassepe/worktree/julia/src/gf.c:2239
#5  0x00007ffff7483185 in jl_apply_generic (F=0x7ffff086f1b0, args=0x7fffef1f3408, nargs=0)
    at /home/lassepe/worktree/julia/src/gf.c:2429
#6  0x00007ffff74a8cdc in jl_apply (args=0x7fffef1f3400, nargs=1) at /home/lassepe/worktree/julia/src/julia.h:1788
#7  0x00007ffff74aac20 in start_task () at /home/lassepe/worktree/julia/src/task.c:877
odow commented 2 years ago

That just means the issue is probably in us cleaning up an instance of Path while another is running, and that the two instances share some global state in C. If I recall, there's a workspace feature of the C library that we aren't using it at the moment. But not sure if that would help.

lassepe commented 2 years ago

Thank you for your swift response. I dug a little deeper and it indeed seems like this is a global state issue inside PATH. At least the function signature of the solver suggests that there is only one global solver object.

FUN_DECL(Void) Path_Create(Int maxSize, Int maxNNZ)

PATHSolver.jl never calls that directly but I guess that would be created as a result of the call to Path_Solve. So when the MCP is cleaned up one probably destroys the workspace out from under the other solver running on the same workspace in parallel.

odow commented 2 years ago

So when the MCP is cleaned up one probably destroys the workspace out from under the other solver running on the same workspace in parallel.

Yes. We can probably fix this by manually creating and managing our own workspace, but I don't know what this entails. You'd have to look into the C code. I'll review a PR if you have time to work on it :smile:

You can find the C header files here: http://www.cs.wisc.edu/~ferris/path/path_5.0.05_Linux64.zip

lassepe commented 2 years ago

A colleague of mine contacted the PATH team and Steven Dirkse replied:

Unfortunately, PATH is indeed known to NOT be thread-safe. There are multiple reasons for this that go deep into the way PATH is organized. We spent some time considering this issue some time ago and concluded that the fixes would require a substantial amount of effort, something we could not commit to then or since.

I cannot suggest any workaround I would feel comfortable and safe with, apart from avoiding multi-threaded use of PATH.

Therefore, I guess there's not much we can do on this end for now. A note on this issue in the README or somewhere in the docs may be the best solution for now.

odow commented 2 years ago

Okay, I'll add a note. Thanks for digging into this further!

odow commented 2 years ago

I'll also say that threading might be broken, but you can still use process-based parallelism:

import Distributed
Distributed.addprocs(4)
Distributed.@everywhere begin
    import Pkg
    Pkg.activate("/tmp/path")  # Or wherever your environment is
end

Distributed.@everywhere begin
    import PATHSolver
    import SparseArrays
    function solve_example(i)
        println("Solving i=$i on node $(Distributed.myid())")
        M = convert(
            SparseArrays.SparseMatrixCSC{Cdouble,Cint},
            SparseArrays.sparse([
                0 0 -1 -1
                0 0 1 -2
                1 -1 2 -2
                1 2 -2 4
            ]),
        )
        return PATHSolver.solve_mcp(
            M,
            Cdouble[2, 2, -2, -6],
            fill(0.0, 4),
            fill(10.0, 4),
            Cdouble[0.0, 0.0, 0.0, 0.0],
            output = "no",
        )
    end
end

results = Distributed.pmap(solve_example, 1:10)
lassepe commented 2 years ago

Right, multi-processing works because each process loads the lib separately. Unfortunately, in my current usecase the problems are very small so multi-processing adds too much overhead to provide an overall speedup. Still good to keep in mind for other settings though.

Should I close this issue since it's not real actionable or do you want to keep it around as additional warning/reminder?

odow commented 2 years ago

I'll close it when I merge #63