espressomd / espresso

The ESPResSo package
https://espressomd.org
GNU General Public License v3.0
225 stars 183 forks source link

Electrodes tutorial is unstable #4798

Closed jngrad closed 9 months ago

jngrad commented 11 months ago
schlaicha commented 11 months ago
  • steepest descent doesn't actually debump the particles, it only updates the ICC charges to minimize the Coulomb energy

This actually is not the expected behavior: espressomd.integrate.Integrator.set_steepest_descent() says: "This feature is used to propagate each particle by a small distance parallel to the force acting on it." Thus, the bug is not in the tutorial but in the behavior of steepest descent!

  • adding the electrostatics solver and ICC extension after the steepest descent causes strong deviations in the final plot

This strongly seems to be related to the previous point.

  • particles experience huge forces in the final integration loop, some of them can actually cross the ELC walls, example: pipeline 353771

The timestep in the tutorial is chosen at the upper limit of stability to obtain statistical meaningful results within a reasonable simulation time. Fixing the seed and minimization should not affect the results. Large forces are a result of not minimizing during steepest descent. Indeed, running about 5 times locally I obtain physically reasonable results in all cases.

IMHO not an issue in the tutorial but in steepest descent minimization. I absolutely don't have capabilities to look into this in the upcoming weeks.

schlaicha commented 11 months ago

Btw, I believe I fixed all random seeds, so why is the test failing occassionally?? The more I think about it it seems to me that to adress this issue the energy minimization needs to be fixed with electrostatics solver and ICC added!!

jngrad commented 11 months ago
  • steepest descent doesn't actually debump the particles, it only updates the ICC charges to minimize the Coulomb energy

This actually is not the expected behavior: espressomd.integrate.Integrator.set_steepest_descent() says: "This feature is used to propagate each particle by a small distance parallel to the force acting on it." Thus, the bug is not in the tutorial but in the behavior of steepest descent!

It's a bug in the tutorial, not steepest descent. The Coulomb energy is extremely high, because particles are overlapping, thus the ICC struggles to iteratively adapt the charges. This is why it's important to remove particle overlap before adding an electrostatic algorithm.

  • adding the electrostatics solver and ICC extension after the steepest descent causes strong deviations in the final plot

This strongly seems to be related to the previous point.

I don't think so. I also tried increasing the number of warmup cycles in the final sampling cell, and got a different curve that deviated significantly from the linear PB at low surface charges.

  • particles experience huge forces in the final integration loop, some of them can actually cross the ELC walls, example: pipeline 353771

The timestep in the tutorial is chosen at the upper limit of stability to obtain statistical meaningful results within a reasonable simulation time. Fixing the seed and minimization should not affect the results. Large forces are a result of not minimizing during steepest descent. Indeed, running about 5 times locally I obtain physically reasonable results in all cases.

The issue with large time steps is that particles can go through a wall in a single step, if their velocity is large enough. From what I saw, only a small number of particles get large velocities, however it's hard to reproduce.

IMHO not an issue in the tutorial but in steepest descent minimization. I absolutely don't have capabilities to look into this in the upcoming weeks.

I already tried to address these issues on Monday, but gave up when I saw the curve wasn't reproducible if I touched any of the steepest descent or warmup steps, or increased the Langevin friction coefficient. I merged the work because was too late to fix it anyway, with the summer school less than a week away. If you or @Frieda95 cannot look into this in the next few weeks, we'll have to consider removing the tutorial, as we cannot deliver ESPResSo with a broken feature.

Btw, I believe I fixed all random seeds, so why is the test failing occassionally?? The more I think about it it seems to me that to adress this issue the energy minimization needs to be fixed with electrostatics solver and ICC added!!

The P3M tuning algorithm is not deterministic, so small random deviations can be expected each time the tutorial runs, which are guaranteed to be within the requested accuracy. Since the tutorial samples the system for long times, these variations can accumulate. Why some particles get such larges forces is unclear to me.

schlaicha commented 11 months ago

Sorry I cannot reproduce this behavior.

I just ran 10 times locally the tutorial, also printing np.abs(system.part.all().f).max(), which always gives the exact numerical value of 467.1306731502407, which is a plausible (although still large, I agree) value and also does not change upon increasing the number of steps. Playing with the steepest descent parameters also does not really change this value except if the relaxation is too fast, which leads to particles bumping into each other.

Also, I checked that the particle positions are changed as intended.

All curves are perfectly reproduced when running locally.

Changing the friction coefficient is not trivial: The equiilibration is limited by diffusion, i.e. when increasing $\gamma$ you will not sample the equilibrium anymore. Contrary, when decreasing $\gamma$ you will need to reduce the time-step.

jngrad commented 11 months ago

Setting the steepest descent after setting the long-range actor means we have multiple local minima, and the algorithm doesn't converge (the number of integrated time steps is equal to the number requested):

>>> system.electrostatics.solver = setup_electrostatic_solver(POTENTIAL_DIFF)
>>> system.integrator.set_steepest_descent(f_max=10, gamma=50.0,
...                                        max_displacement=0.02)
>>> print(system.integrator.run(250))
250
>>> print(np.max(np.linalg.norm(system.part.all().f, axis=1)))
520.303

By debumping particles before switching electrostatics on, we converge almost immediately:

>>> system.integrator.set_steepest_descent(f_max=10, gamma=50.0,
...                                        max_displacement=0.02)
>>> print(system.integrator.run(250))
1
>>> system.electrostatics.solver = setup_electrostatic_solver(POTENTIAL_DIFF)
>>> print(system.integrator.run(250))
0
>>> print(np.max(np.linalg.norm(system.part.all().f, axis=1)))
5.432

But then, during integration with Langevin:

>>> system.integrator.run(STEPS_PER_SAMPLE)
Exception: while calling method integrate(): ERROR: Particle 390 entered ELC gap region by 5.51146

Tracking the system state reveals one particle went ballistic, moving 9 units of length along the z-direction per frame, and thus probably didn't even detect the ELC wall:

>>> for i in range(-3, 0):
...     with np.printoptions(precision=1, suppress=True):
...         print(f"\n{abs(i+1)} time steps before entering ELC gap region")
...         p_f, p_v, p_pos = state[i]
...         print(f"f = {p_f[pid]}")
...         print(f"v = {p_v[pid]}")
...         print(f"v*dt = {p_v[pid] * system.time_step}")
...         print(f"pos = {p_pos[pid]}")
2 time steps before entering ELC gap region
f = [ -1.7   4.4 -86.1]
v = [ -0.   -0.9 895.6]
v*dt = [-0. -0.  9.]
pos = [33.1 31.  18.1]

1 time steps before entering ELC gap region
f = [ -0.2   4.1 -86.9]
v = [ -0.   -0.9 894.7]
v*dt = [-0.  -0.   8.9]
pos = [33.1 31.  27.1]

0 time steps before entering ELC gap region
f = [ -3.3  -4.6 -90.9]
v = [ -0.1  -0.9 893.8]
v*dt = [-0.  -0.   8.9]
pos = [33.1 31.  36. ]

Not easily reproducible, but when it does, it's at time step 7645. See MWE in jngrad/electrodes-mwe. You may have a hard time reproducing it on your laptop if it has an ARM chip in it, where floating-point operations are not truncated the same way as on x86_64 chips.

jngrad commented 10 months ago

Found out the culprit: the WCA potential is too steep for the given time step.

When tracking the particle that enters the ELC gap, one finds that its velocity is almost completely unaffected by external forces, including the Langevin thermostat due to the small friction coefficient. The plot below shows the last 600 frames of the simulation. The particle gets too close to the floor wall (z=0.247) due to the large time step and experiences a force that dwarfs all other forces in the system. The particle follows a straight path to the ceiling and appears inside its WCA region (z=30.262), where it experiences a slightly larger force pointing in the opposite direction and flies straight back to the floor, where it gets even closer (z=0.183) and gets a force twice larger in magnitude than before. In the last bounce, the particle picks up a force one order of magnitude larger, which allows it to fly past the ceiling wall.

Plot of the position, velocity and force as a function of the simulation time for the ballistic particle. The particle bounces back and forth between the two walls, gaining more momentum each time. The position curve look like a triangular wave whose slope gets steeper at every bounce. The velocity curve looks like a rectangular wave that increases in magnitude at every bounce. The force curve looks like a Dirac at every bounce.

The branch mentioned in the previous post now contains the code to create that plot, as well as code to reset the system state to any point in time to reproduce any of these bounces, assuming you can get the initial simulation to fail. This bug is random and due to numerical instability for the given time step. Even when pinning the tuned parameters, it still fails at random. Hence it is not possible to hardcode tuned parameters in the test script to make it work reliably in CI, which is a bad idea anyway, since tutorial users will experience the random failure, and when using this tutorial as the starting point for a production simulation, one has to tune the P3M algorithm.

In addition, when changing the random seed from 42 to 41 in both the numpy RNG and Langevin RNG, strong deviations from the linear PB model can be observed at small $\sigma$ in the final plot:

mwe-rng-seed

This plot can be reproduced using the same branch, where there is an extra file electrodes_part2-seed.ipynb.

schlaicha commented 9 months ago

it seems that the steepest descent did not relax the system at all. I copied the parameters from the LJ tutorial which seems to have fixed all issues, IMHO the time-step should be okay and also the results do not depend on the seeds anymore...