[BUG] Sire OpenMM context minimiser isn't thread safe

lohedges commented 2 weeks ago

I'm using threads to run multiple minimisations in parallel and see segmentation faults. The LBFGS code claims that it is thread safe, so maybe this is an issue with OpenMM, or the way that Sire is interfacing with things.

For the current replica exchange implementation I create contexts when a runner is instantiated and keep them alive for the duration of the simulation, with threads uses to execute them in parallel. When using processes, it's only possible to create the context within the process due to a CUDA initialisation issue, plus the fact that contexts can't be pickled. Running dynamics works absolutely fine with threads, i.e. I can use threads to manage different contexts on multiple GPUs, so this isn't any issue with running multiple contexts in parallel via threads. It's exclusively a problem with minimisation. Currently I'm minimising in serial, then running the dynamics in parallel.

lohedges commented 2 weeks ago

This is an issue with Sire since the segmentation fault doesn't happen if I minimise the context within the dynamics object using openmm.LocalEnergyMinimizer directly. I'll do that for now.

lohedges commented 2 weeks ago

I noticed that the vendored LBFGS source files are quite different to the current OpenMM ones here. However, syncing them makes no difference to the segmentation fault.

lohedges commented 2 weeks ago

This is a result of Sire releasing the GIL during the minimisation function. The following diff fixes things:

diff --git a/wrapper/Convert/SireOpenMM/openmmminimise.cpp b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
index dd5a5f53..b8334ff5 100644
--- a/wrapper/Convert/SireOpenMM/openmmminimise.cpp
+++ b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
@@ -629,8 +629,6 @@ namespace SireOpenMM
             timeout = std::numeric_limits<double>::max();
         }

-        auto gil = SireBase::release_gil();
-
         const OpenMM::System &system = context.getSystem();

         int num_particles = system.getNumParticles();

I've added this to my other fix branch and will use this for now.

lohedges commented 2 weeks ago

While this works locally, I'm seeing hangs when running on neogodzilla. I'll continue to use the OpenMM minimiser directly for now.

lohedges commented 2 weeks ago

Yes, it's just running in serial if the GIL isn't released. Will need to figure out why releasing it is causing a segmentation fault.

lohedges commented 2 weeks ago

This does the job:

diff --git a/wrapper/Convert/SireOpenMM/openmmminimise.cpp b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
index dd5a5f53..2f08f450 100644
--- a/wrapper/Convert/SireOpenMM/openmmminimise.cpp
+++ b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
@@ -50,6 +50,7 @@
 #include <limits.h> // CHAR_BIT
 #include <sstream>
 #include <stdint.h> // uint64_t
+#include <Python.h>

 inline auto is_ieee754_nan(double const x)
     -> bool
@@ -619,6 +620,8 @@ namespace SireOpenMM
                                     double starting_k, double ratchet_scale,
                                     double max_constraint_error, double timeout)
     {
+        PyThreadState *_save = PyEval_SaveThread();
+
         if (max_iterations < 0)
         {
             max_iterations = std::numeric_limits<int>::max();
@@ -629,8 +632,6 @@ namespace SireOpenMM
             timeout = std::numeric_limits<double>::max();
         }

-        auto gil = SireBase::release_gil();
-
         const OpenMM::System &system = context.getSystem();

         int num_particles = system.getNumParticles();
@@ -1105,6 +1106,8 @@ namespace SireOpenMM
                                            CODELOC);
         }

+        PyEval_RestoreThread(_save);
+
         return data.getLog().join("\n");
     }

chryswoods commented 2 weeks ago

Interesting that you needed to do that explicitly. I've seen something like this before when default arguments were passed to the function from Python. The SireBase GIL functionality would release the GIL on function exit (e.g. both return and when raise exception) but this happened after destruction of the default arguments. As these were connected to Python objects, their destruction caused some change in the Python interpreter state while the GIL was held, hence a segfault. This is why the auto-wrapped functions that had default arguments weren't wrapped with the SireBase GIL code.

My guess is that there is something here that links back to Python that is being deallocated on function return before the GIL is being re-acquired. The Python GIL is a dark art ;-)

lohedges commented 2 weeks ago

Yes, very puzzling. I'm just pleased that the solution was easy. I'll also make sure the thread state is being reset if an exception is thrown. I think the code already handles that via the data log anyway.

lohedges commented 1 week ago

Yes, it appears that everything is handled via try/catch within the main minimise_openmm_context function, so this should be okay.

OpenBioSim / sire

[BUG] Sire OpenMM context minimiser isn't thread safe #259