Closed espresso-ci closed 4 years ago
The Wang-Landau python test (testsuite/python/wang_landau_reaction_ensemble.py
) can take anywhere from 100s to 260s in CI jobs with coverage enabled. Sometimes it takes more than 300s and times out.
Unscientific impression: The particle serialization and fetching to the head node became really slow recently in debug builds. Most of the stuff in the faster tests pr addresses those situations as well (by avoiding them), e.g. in observable evaluation.
For Wang Landau RE, we can temporarily switch the test to one mpi rank. That does, however, actually result in less testing.
From: Jean-Noël Grad notifications@github.com Sent: 24 June 2020 11:09 To: espressomd/espresso espresso@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [espressomd/espresso] Wang-Landau test is timing out (#3771)
The Wang-Landau python test (testsuite/python/wang_landau_reaction_ensemble.py) can take anywhere from 100s to 260s in CI jobs with coverage enabled. Sometimes it takes more than 300s and times out.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/espressomd/espresso/issues/3771#issuecomment-648697376 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AADLTKN4JQJVTPXXJ3BWIV3RYG7BNANCNFSM4OGHQS2A .
The reaction ensemble is front-end code, so there is no particular benefit for testing to run it in parallel. At some point the future of the RI code has to be decided, I should either be made a proper integrator or the front-end interface hast to be extended to give it proper performance. Also I guess the bus factor for this is 0.
The test already runs on a single MPI rank. Looking at the C++ code, there are a few spots that looked surprising to me at first glance, but turned out to be actually ok. Although without proper profiling it's difficult to conclude anything. Right now, the hot loop in the test runs 100000 times before convergence is reached: https://github.com/espressomd/espresso/blob/25ca96d2bc9b6d400b7437c10373288f5ae7b04e/testsuite/python/wang_landau_reaction_ensemble.py#L101-L107
The hot loop in the core is hit 100000 times (or 200000 times? the coverage reports seems inaccurate, the loop body should run twice). The convergence criterion is tested here: https://github.com/espressomd/espresso/blob/25ca96d2bc9b6d400b7437c10373288f5ae7b04e/src/core/reaction_ensemble.cpp#L1246-L1250 L1246 is hit 100000 times, L1248 is hit 9 times.
Then there is this check that's hit 100000 times, but the body is hit once. The std::max
seems to be correct, replacing it with std::min
increased the runtime 5-fold.
https://github.com/espressomd/espresso/blob/25ca96d2bc9b6d400b7437c10373288f5ae7b04e/src/core/reaction_ensemble.cpp#L1256-L1257
I think the actual performance problems lie deeper. The hot loop of the MC does trigger an update of the head node particle cache in each iteration, which is probably what made it slow. I'll see if I can find a quick fix for this. Are you sure that you have time to jump into this rabbit hole?
@fweik I found a solution by accident: there is a performance bottleneck in a for
loop that runs 500'000 times:
diff --git a/src/core/reaction_ensemble.cpp b/src/core/reaction_ensemble.cpp
index c3e8fe9b2..aec4e13b7 100644
--- a/src/core/reaction_ensemble.cpp
+++ b/src/core/reaction_ensemble.cpp
@@ -310,8 +310,8 @@ void ReactionAlgorithm::make_reaction_attempt(
// create or hide particles of types with noncorresponding replacement types
for (int i = std::min(current_reaction.product_types.size(),
current_reaction.reactant_types.size());
- i < std::max(current_reaction.product_types.size(),
- current_reaction.reactant_types.size());
+ i < current_reaction.product_types.size() and i <
+ current_reaction.reactant_types.size();
i++) {
if (current_reaction.product_types.size() <
current_reaction.reactant_types.size()) {
Now the test runs in 10 s with coverage enabled. Not sure why. With coverage enabled and no optimization, the original code assembly is 40% longer and contains a call to the std::max
function. I don't understand why such a dramatic 25-fold decrease. According to coverage, we are no longer hitting delete_particle()
and hide_particle()
...
nevermind, this patch is just plain wrong :)
Yes, I was just about to ask if this is really equivalent :D
the test passed, though :smile:
Well, high level integration tests are just how we roll... ^^
I looked at the code, I think the problem is the invalidation of the particle-node-mapping during the move attempts. I could find and easy way to fix this, because I could not understand the logic of the code. Unfortunately I don't have time to look into this any further.
@fweik I've also looked at the code in more detail with valgrind and found another lead. According to valgrind, 46% of the self time is spent in obs_energy
and 19% in the obs_energy
MPI reduction free function. Now that the observable is stateless, each call to calculate_current_potential_energy_of_system()
recalculates everything. In fact this function is hit almost 1 million times and has an inclusive time of 96%. I think that many MC moves do not change the state of the system. According to valgrind on 4.1.3rc, in 60% of the 1 million hits, the conditional jump skipped re-calculation of the energy kernels and MPI communication. The function calculate_current_potential_energy_of_system()
had a inclusive time of 82%. Overall the wang_landau test is 15 seconds slower with the stateless observables, which brings the runtime slightly closer to the 300s. This might explain why the test has been timing out more frequently over the past few weeks. This doesn't make much of a difference for MC users who likely won't notice much of a change in a release build. What we could do for now, is choose a random seed that will make the simulation converge faster without changing line coverage, although this is not a long-term solution. Whoever maintains MC in the future will have to improve the code performance as their first task, or write unit tests so that we can disable wang_landau in the coverage CI jobs or choose a simpler system which converges faster at the expense of less coverage.
Looking at the code, I think the long-term survival prognosis for this is rather dire, so I think we should invest the least amount possible...
Despite the 30% speed-up from #3775, the test still times out on coyote2 and coyote4 when the runner runs 2 jobs simultaneously. When a single job runs, the test passes within ~276s. More recent runners pass this test in ~220s. @jonaslandsgesell do you have any suggestion on how we could speed up that test further?
I would expect that the slow part is the energy calculation and/or the update of the particle chache if espresso is compiled without optimization. So either the particle cache update is made efficient or the energy update is made more efficient.
Maybe an option could also be to run certain tests only with a compiled program which has optimizations turned on.
@jonaslandsgesell it's probably the caches, because that is the part the changed. Probably this can be helped by reordering stuff, if somebody invests the time to look into this.
So either the particle cache update is made efficient or the energy update is made more efficient.
Based on valgrind on 4.1.3rc vs 4.2.0-dev, I'd say that caching the potential energy in the RE classes and re-introducing the Observable_stat
callbacks as invalidate_mc_cache()
to invalidate the RE cache (on particle addition/removal, charge change) would buy us between 15s to 30s. I'm afraid this is not enough, and would re-introduce too much coupling between components of ESPResSo. Right now the RE module is on the right path to be transferred to its own pitchfork subfolder and split into multiple files for #3284, with e.g. a utility functions header file (see 9620127f7 which allows 25b4b30ac7) and one RE class per header file.
I can't really tell regarding particle cache efficiency. I was just wondering why the core hot loop in the Wang-Landau class had to run 1 million times for such a small system with 4 particles to reach convergence. This test feels more like a sample than an integration test, but if we convert it to a sample, code coverage is gone for RE.
The script calculates the potential energy density of states of a reacting 3D harmonic oscillator (see https://en.wikipedia.org/wiki/Wang_and_Landau_algorithm#Test_system) and then calculates the heat capacity of the system. In principle one could reduce the dimensions from 3 to 1 and look at a harmonic oscillator in 1D. This would speed up the process a lot, because the energy states are then not degenerate with E^0.5 but with E^-0.5. This would for sure reduce comutational effort, but would require to introduce MC moves which only change in 1D (i.e. not propose arbitrary 3D position moves, for this only the function displacement_mc_move_for_particles_of_type() would need to be altered so that it moves particles along a line only instead of in a sphere). This is not hard, but also the analytical calculation for the heat capacity needs to be redone again (also not hard).
Actually I did the calculations for this already in the appendix of my master thesis and it is C_V(potential Energy)/(k_B beta^2)=0.5=
If I had to guess this could reduce the runtime by a factor of ~5.
I will prepare a PR for this.
This would still be a physical test instead of a proper integration test. According to valgrind, the RE methods account for 0.5% of the total runtime. What we really need here is an integration test with a system that converges at the 90'000th iteration of the for
loop. Otherwise the test will time out again in the future when we refactor the energy calculation or cell system.
We also cannot recompile parts of the code with a higher level of optimization for a single test without implementing a new CI framework to cache .gcda files.
I just tested the idea of using a 1D harmonic oscillator to speed up the tests. Unfortunately, it took the same amount of time to execute the test for a 1D harmonic oscillator or a 3D harmonic oscillator.
Seems like you need to drop the Wang-Landau reaction ensemble if the tests take too long. (The reaction ensemble and the cpH ensemble I would keep though, as this is also used in Peters group).
Do you know why that did not help?
Speculating, and seeing the timings, I would say that it is computationally the same cost because in both cases the potential energy density of states g(E(r)) needs to be computed. Somehow it seems to be computationally the same for the same range of energies, regardless of E being a function of r=x (1D), r=x,y (2D) or r=x,y,z (3D). But intuitively, I would have expected that the computational cost increases with dimensions.
https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/pipelines/12325