espressomd / espresso

The ESPResSo package
https://espressomd.org
GNU General Public License v3.0
225 stars 183 forks source link

LB particle coupling thermostat seed doesn't fully decorrelate independent simulations #4848

Closed jngrad closed 7 months ago

jngrad commented 7 months ago

TL;DR: the seed argument of system.thermostat.set_lb() doesn't actually set the random number generator (RNG) seed of the LB particle coupling code. Instead, it sets the lag time in the RNG sequence. Thus two independent simulations with a different seed will exhibit correlated dynamics. The LB fluid seed is not affected by this issue.

Background

We use Philox counters to generate reproducible and cross-platform random number sequences. These are used to calculate the friction noise during thermalization. The RNG sequence is fully determined by 4 input arguments:

Running the same simulation twice with the same RNG seed produces the exact same trajectory, up to machine precision. Running the same simulation twice with a different RNG seed produces a different trajectory. For the latter, the magnitude of the difference between the two seeds is irrelevant; the RNG seed "decorrelates" RNG sequences, just like the RNG salt does. This is achieved by internally concatenating the salt and seed into a single 64bit integer.

The Philox RNG has an internal counter which is incremented at each time step. This counter essentially controls the current position in the RNG sequence. We do not provide a convenient way for the user to initialize this value. This counter value is saved during checkpointing.

Problem statement

For historical reasons, thermostats in ESPResSo were all implemented differently, and didn't share a common base class. Yet they shared a unified Python interface, which could give users the false impression that they were related. This lead to bad user experience in the 4.1 line of ESPResSo where one thermostat was using the Mersenne Twister RNG instead of the Philox RNG.

The LB particle coupling thermostat in particular was implemented with a hardcoded seed=0, regardless of which seed the user gives as argument to system.thermostat.set_lb(). Yet the particle forces are different when giving a different seed, because the value of the seed argument is stored as the internal counter of the Philox RNG. Thus the LB particle coupling RNG sequence is always the same for all simulations, and the seed argument actually controls the initial time lag in that RNG sequence. This issue is specific to the LB particle coupling code, and does not affect the LB fluid code.

As a result, two independent simulations that differ only in the LB particle coupling seed with not truly be independent: the particle forces from simulation A will be correlated to the particle forces from simulation B with a time lag. This correlation is weak, since it's "diluted" by the LB fluid noise and all other interactions in the system.

It would be useful to know how weak this correlation truly is, and whether it can affect some observables more than other, for example Green-Kubo relations. To that end, I adapted the polymers tutorial to investigate how the correlation between independent simulations that differ only by the LB particle coupling seed affects simulation results. I measured the diffusion coefficient of a polymer with 4 beads connected by FENE bonds, repelled by a WCA potential, and coupled to a thermalized fluid in a 8x8x8 grid in units of the WCA sigma. The simulation was repeated 48 times for two scenarios: one where the Philox seed is 0 and the Philox counter initial value is controlled by the user ("current behavior"), and one where the Philox seed is controlled by the user and the Philox counter initial value is always 0 ("proposed behavior").

Figure 1 plots the mean and standard error of the mean of the diffusion coefficient as a function of the simulation time for both scenarios. We can see in the proposed behavior scenario that the error band tapers for longer simulation times, as expected. In the current behavior scenario, the error band is extremely narrow right at the beginning, and tapers more slowly. Figure 2 plots the standard error of the mean as a function of time to help better visualize how the correlation between independent measurements affects the magnitude of the standard error of the mean.

In my opinion, this is a design flaw. Data production is affected in two ways:

  1. error bars in plots are artificially shrunk by a factor 2.5, because the underlying assumption of independent samples in the formula of the standard of the mean is not satisfied
  2. users might decide to simulate for shorter times, because these independent simulations converge to the same value much faster than they should

Point 1 doesn't affect the samples mean, thus the calculated data point (or the shape of the curve for time-dependent observables) converges to the real value. Point 2 can affect the sample mean due to insufficient sampling time.

In the light of the data produced by this experiment, I'm not convinced we can simply simulate for longer to get the LB fluid RNG to introduce enough entropy to fully decorrelate independent simulations. The current behavior scenario was run again with a hardcoded seed of 1 (data not shown) to make sure my observations were not due to a statistical fluke with the seed of 0, and the same trend emerged.


Plot of the diffusion coefficient as a function of the simulation time, for both the current behavior of ESPResSo, and the new proposed behavior. In the current behavior, the standard error of the mean is extremely narrow and much smaller in magnitude than the magnitude of the fluctuations of the mean between two at short sampling times. In the proposed behavior, the standard error of the marge is more reasonable and contains the natural fluctuation of the mean at short simulation times, and tapers progressively towards longer simulation times.

Figure 1: Convergence analysis of the polymer diffusion coefficient.


Plot of the standard error of the mean of the diffusion coefficient as a function of the simulation time. The proposed behavior is initially 3 times larger in magnitude compared to the current behavior, and 2.2 times larger towards large simulation times.

Figure 2: Comparison of the standard error of the mean in both scenarios.

Proposed solution

Make the seed parameter set the Philox seed, instead of the Philox counter. This was implemented by commit 9550aa1f24b215c090717a6b870b293293e9f535 in #4845. This fix could be backported to the 4.2 line of ESPResSo with an extra MPI callback and 3 lines of Cython code.

Supporting information

Simulation script ```python import sys import numpy as np import scipy.optimize import espressomd import espressomd.analyze import espressomd.accumulators import espressomd.observables import espressomd.polymer import espressomd.lb espressomd.assert_features(['WCA', 'WALBERLA']) def build_polymer(system, n_monomers, polymer_params, fene): positions = espressomd.polymer.linear_polymer_positions( beads_per_chain=n_monomers, **polymer_params) p_previous = None for i, pos in enumerate(positions[0]): p = system.part.add(pos=pos) if p_previous is not None: p.add_bond((fene, p_previous)) p_previous = p def correlator_gk(pids_monomers, tau_max): com_vel = espressomd.observables.ComVelocity(ids=pids_monomers) com_vel_cor = espressomd.accumulators.Correlator( obs1=com_vel, tau_lin=16, tau_max=tau_max, delta_N=10, corr_operation="scalar_product", compress1="discard1") return com_vel_cor def gk_integration(N, tau, acf): x = np.arange(2, 28) y = [1 / 3 * np.trapz(acf[:j], tau[:j]) for j in x] diffusion_coef = np.mean(y[10:]) return diffusion_coef, tau[x], y # Setup constants BOX_L = 8.0 TIME_STEP = 0.01 LOOPS = 100 STEPS = 100 KT = 1.0 GAMMA = 5.0 POLYMER_PARAMS = {'n_polymers': 1, 'bond_length': 1, 'seed': 42, 'min_distance': 0.9} # System setup system = espressomd.System(box_l=3 * [BOX_L]) system.cell_system.skin = 0.4 # Lennard-Jones interaction system.non_bonded_inter[0, 0].wca.set_params(epsilon=1.0, sigma=1.0) # Fene interaction fene = espressomd.interactions.FeneBond(k=7, r_0=1, d_r_max=2) system.bonded_inter.add(fene) com_vel_tau_results = [] com_vel_acf_results = [] N = 4 seed = int(sys.argv[-2]) counter = int(sys.argv[-1]) build_polymer(system, N, POLYMER_PARAMS, fene) # energy minimization system.time_step = 0.002 system.integrator.set_steepest_descent( f_max=1.0, gamma=10, max_displacement=0.01) system.integrator.run(2000) # thermalization with Langevin system.integrator.set_vv() system.time_step = TIME_STEP system.thermostat.set_langevin(kT=1.0, gamma=50, seed=42) system.integrator.run(2000) system.thermostat.turn_off() # thermalization with LB lbf = espressomd.lb.LBFluidWalberla(kT=KT, seed=42, agrid=1, density=1, kinematic_viscosity=5, tau=system.time_step, single_precision=True) system.lb = lbf system.thermostat.set_lb(LB_fluid=lbf, gamma=GAMMA, seed=seed, philox_counter=counter) system.integrator.run(1000) # configure Green-Kubo correlator com_vel_cors = [] n_loops = [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 256, 384, 512] n_steps = [] for i in n_loops: com_vel_cor = correlator_gk(np.arange(N), LOOPS * STEPS) system.auto_update_accumulators.add(com_vel_cor) com_vel_cors.append(com_vel_cor) def save_data(n_steps, com_vel_tau_results, com_vel_acf_results): com_vel_tau_results = np.array(com_vel_tau_results) com_vel_acf_results = np.reshape(com_vel_acf_results, [len(n_steps), -1]) diffusion_gk = [] diffusion_time = [] for length, tau, acf in zip(n_steps, com_vel_tau_results, com_vel_acf_results): diffusion_coef, xdata, ydata = gk_integration(N, tau, acf) diffusion_gk.append(diffusion_coef) diffusion_time.append(length * TIME_STEP) np.save(f"lb_seed_{seed}_counter_{counter}.npy", [diffusion_gk, diffusion_time]) last_sim_steps = 0 for i, n in enumerate(n_loops): com_vel_cor = com_vel_cors[i] sim_total = n * LOOPS * STEPS system.integrator.run(sim_total - last_sim_steps) system.auto_update_accumulators.remove(com_vel_cor) last_sim_steps = sim_total com_vel_cor.finalize() com_vel_tau_results.append(com_vel_cor.lag_times()) com_vel_acf_results.append(com_vel_cor.result()) n_steps.append(sim_total) save_data(n_steps, com_vel_tau_results, com_vel_acf_results) ```
Launch script ```sh for counter in {1..16}; do ./pypresso mwe.py 0 $counter & done for seed in {1..16}; do ./pypresso mwe.py $seed 0 & done ```
Data analysis ```python import numpy as np import matplotlib.pyplot as plt plt.rcParams.update({'font.size': 16}) def parse(n, pat, burn=0): root = "./" matrix = [] for i in range(n): diffusion_gk, diffusion_time = np.load(f"{root}/{pat.format(i+1)}") matrix.append(diffusion_gk) n = min(len(x) for x in matrix) return diffusion_time[burn:n], np.array([x[burn:n] for x in matrix]) df_old_time, df_old_coef = parse(48, "lb_seed_0_counter_{}.npy", 2) df_new_time, df_new_coef = parse(48, "lb_seed_{}_counter_0.npy", 2) df_old_mean = np.mean(df_old_coef, axis=0) df_old_stderr = np.std(df_old_coef, axis=0) / np.sqrt(df_old_coef.shape[1]) df_new_mean = np.mean(df_new_coef, axis=0) df_new_stderr = np.std(df_new_coef, axis=0) / np.sqrt(df_new_coef.shape[1]) fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(11, 4)) ax2.fill_between([], [], []) ax1.fill_between(df_old_time, df_old_mean - df_old_stderr, df_old_mean + df_old_stderr, alpha=0.5) ax2.fill_between(df_new_time, df_new_mean - df_new_stderr, df_new_mean + df_new_stderr, alpha=0.5) ax1.plot(df_old_time, df_old_mean, label="Current behavior") ax2.plot([], []) ax2.plot(df_new_time, df_new_mean, label="Proposed behavior") ax1.set_xlabel("Number of time steps") ax2.set_xlabel("Number of time steps") ax1.set_ylabel(r"Diffusion coefficient [$\sigma^2/t$]") ax1.legend(loc="lower right") ax2.legend(loc="lower right") plt.tight_layout() plt.show() fig = plt.figure(figsize=(8, 4)) plt.plot(df_old_time, df_old_stderr, label="Current behavior") plt.plot(df_new_time, df_new_stderr, label="Proposed behavior") #plt.yscale('log') #plt.xscale('log') plt.legend() plt.xlabel("Number of time steps") plt.ylabel("Standard error of\n" r"the mean [$\sigma^2/t$]") plt.tight_layout() plt.show() ```
jngrad commented 7 months ago

This topic was actually already brought up for all other thermostats in #3585. In fact, the solution implemented in #4845 is exactly the same as the solution we implemented back in 2020. I don't remember why we chose to leave the LB particle coupling code out of the bugfix PR back then.

jngrad commented 7 months ago

Offline discussion with @RudolfWeeber: this is indeed a bug, albeit not a serious one, since it only affects ensemble runs.

I looked into the possibility of backporting the bugfix to 4.2, but the LB CPU and LB GPU use different Philox kernels while sharing the same RNG struct, thus significant alterations to the LB GPU code would be required. We will not backport it to 4.2.