Performance regressions since 4.0

RudolfWeeber commented 8 months ago

There seem to be a number of performance regression. Between 4.0 and the current python branch, the lj test deteriorates from 0.41ms to 0.6ms on gerenuk. I tried to keep myconfig etc constant.

The performance degradation occurs in (at least 2 steps). I tried to git-bisect (with mixed results) because the timings are somewhat noisy.

Suspects:

[ ] 337a74f207171c03f1cca06266b1a5be1ddd4c89 This is a thermostat clean-up. I'm surprised that htis has a 10% performance impact. Maybe this nees to be confirmed on an other machine. A possible cause is that the new code might call diagonal_matrix() more often.

RudolfWeeber commented 8 months ago

Further suspect

[ ] 0616a11a335385b8ca38a78ec425b029d6d426af This is a change in the pair force kernel. I have ~0.5ms with and ~0.43ms without this commit for the lj benchmakr on 8 cores on Gerenuk.

Probably the effect is two-fold:

the node-local short range force calculation is slowed down somewhat
As a consequence, the simulation tunes t oa shorter skin leading to more frequent communicaiton (resorts and verlet lsit rebuild)

The simulations tune to different significantly larger skins with this commit reverted.

There is probably a 3rd performance regression between 4.2 and the current dev version.

RudolfWeeber commented 8 months ago

third suspect: [ ] 15337d214a396c34b3da433dfcaa3d3ee65f2cc3 This splits the short rnage force kernels per signature. Probably same mechanism as the 2nd one: slower hotloop -> tuning to shorter skins -> more frequent resorts. It is not totally obvious, why this commit should hurt performnce so much. Maybe a few more ParticleForce objects are initialized and the compiler is not smart enough to understand that they are temporary.

This degrades performance from ~0.5ms to ~0.6ms (lj, 8 cores on Gerenuk)

RiccardoFrenner commented 4 months ago

With this change I got 0.09 ms back from https://github.com/espressomd/espresso/commit/0616a11a335385b8ca38a78ec425b029d6d426af

diff --git a/src/core/Particle.hpp b/src/core/Particle.hpp
index 8bf06fe32..cc336a6f9 100644
--- a/src/core/Particle.hpp
+++ b/src/core/Particle.hpp
@@ -270,17 +270,18 @@ struct ParticleForce {
       : f(f), torque(torque) {}
 #endif

-  friend ParticleForce operator+(ParticleForce const &lhs,
-                                 ParticleForce const &rhs) {
-#ifdef ROTATION
-    return {lhs.f + rhs.f, lhs.torque + rhs.torque};
-#else
-    return lhs.f + rhs.f;
-#endif
+  friend ParticleForce operator+(ParticleForce const &lhs, ParticleForce const &rhs) {
+    ParticleForce result = lhs;
+    result += rhs;
+    return result;
   }

   ParticleForce &operator+=(ParticleForce const &rhs) {
-    return *this = *this + rhs;
+    f += rhs.f;
+#ifdef ROTATION
+    torque += rhs.torque;
+#endif
+    return *this;
   }

As I understand it, the original code should be equally efficient when all optimizations occur. However, this does not seem to be the case, even though I have not verified it directly.

I measured a slightly stronger improvement (0.12 ms) with a very recent commit (fe94b6acc).

For 8 cores it's even a bit more.

Edit: With this change I cannot measure a significant performance difference for the other mentioned commits anymore.

RudolfWeeber commented 4 months ago

Thank you, Riccarod. Can you please open this as a pr?

RudolfWeeber commented 2 months ago

I guess, the most important one has been fixed.

RudolfWeeber commented 2 months ago

h

espressomd / espresso

Performance regressions since 4.0 #4866