[BUG] Minimiser hangs - Githubissues

OpenBioSim / sire

Sire Molecular Simulations Framework

https://sire.openbiosim.org

GNU General Public License v3.0

42 stars 11 forks source link

[BUG] Minimiser hangs #230

Closed lohedges closed 1 month ago

lohedges commented 2 months ago

In quite a few instances I am seeing hangs with the modified OpenMM minimiser. I've tried to mitigate this using the max_iterations kwargs, but the problem persists. It seems that the number of iterations might be set to zero for each ratchet. The available options are:

    Parameters:

    - max_iterations (int): The maximum number of iterations to run
    - tolerance (float): The tolerance to use for the minimisation
    - max_restarts (int): The maximum number of restarts before giving up
    - max_ratchets (int): The maximum number of ratchets before giving up
    - ratchet_frequency (int): The maximum number of steps between ratchets
    - starting_k (float): The starting value of k for the minimisation
    - ratchet_scale (float): The amount to scale k at each ratchet
    - max_constraint_error (float): The maximum error in the constraint in nm

If it's not possible to set things independently, i.e. an absolute number of iterations as the max, then maybe we also want to implement a timeout of some sort? I'll try to find a system that repeatedly hangs to post for debugging purposes. That's one of the other things that's a bit frustrating, since this doesn't happen all the time, e.g. you might see it for one window of one replica of an RBFE run, which causes the whole thing to fail.

lohedges commented 2 months ago

I believe this is related to the change in LJ sigma values for perturbed atoms, i.e they are no longer zero for ghost atoms. This is probably leading to local spikes in potential energy for specific lambda values, which is causing issues for the minimiser. I see this on the main branch of somd2 and have many issues re-running things that worked fine before, e.g. the hydration free-energy test set.

lohedges commented 2 months ago

Yes, this is definitely related to the LJ sigma problem. For somd2, I realised that we don't use dynamics.minimise(), instead calling .minimisation() on the system prior to creating the dynamics object. It looks like the code is old, and predates the setting of parameters such as shift_delta and coulomb_power, hence they take the default values for the minimisation. This means that the minimisation in somd2 seems to work a bit more reliably than calling dynamics.minimise() directly, which was using the optimised somd1 settings, not the defaults. This is clearly a bug in somd2, since it should use consistent settings, but highlights that it is indeed the LJ sigmas that are causing the minimiser to struggle.

lohedges commented 2 months ago

This seems to mostly be resolved via #237. When I get a chance I'll take a closer look at the code to see if it's easy to add logic for a walltime, or similar.