espressomd / espresso

The ESPResSo package
https://espressomd.org
GNU General Public License v3.0
226 stars 183 forks source link

Broken: Correlator & Langevin Thermostat? #904

Closed jdegraaf closed 7 years ago

jdegraaf commented 7 years ago

Dear All,

I've been running some long-time simulations with ESPResSo for a system of passive charged colloids both in a Langevin and a LB fluid with explicit ions to study their velocity / angular velocity auto correlation functions, as well as mean squared displacement. I found several instances of reproducible errors that I think might interfere with anyone using any of the above items (Correlator & Langevin Thermostat). I am currently working on: c466e5b66b5870931466de6a83f791b8df9ccd26

using the following features

ELECTROSTATICS, ROTATION, ROTATIONAL_INERTIA, EXTERNAL_FORCES, CONSTRAINTS, MASS, CATALYTIC_REACTIONS, LB_GPU, LB_BOUNDARIES_GPU, LENNARD_JONES, VIRTUAL_SITES_RELATIVE, LANGEVIN_PER_PARTICLE

where the CATALYTIC_REACTIONS feature is compiled in, but not called for passive spheres (i.e., where no reaction takes place). I use LANGEVIN_PER_PARTICLE because the colloid I simulate is much bigger than the ions, which means it needs to have a different friction coefficient, etc.

1 ##### The correlator seems to be broken

LV_vacf_curves_problem.pdf LB_vacf_curves_problem.pdf

Both the Langevin (LV) and lattice-Boltzmann (LB) velocity auto correlation functions (VACFs) have strange outliers. The graphs show ~80 runs, with the thick blue curve the mean and the cluster of runs above this mean being "good" data, and odd curves at the bottom being the "bad" data. By contrasting the LB and LV, one sees that there are way more "bad" ones in the LV, further indicating that these are an error. Since these effects happen both for the LB and LV, it would seem to me that this should be attributed to the correlator, rather than the LV/LB. The simulations are for a single core and are checkpointed with the usual repair to make this checkpointing work in the blockfile_support.tcl file to get the checkpointing to even work in the case of LB:

github.com/espressomd/espresso/issues/577

This may be related to the previously opened bug report

github.com/espressomd/espresso/issues/509

but now extend the issue to 1 node computing.

2 ##### The Langevin thermostat is probably broken

Disregarding the "bad" VACFs, the remaining ones can be integrated to obtain the time-dependent translational diffusivity see below

LV_DT_time_curves_problem.pdf LB_DT_time_curves_no_problem.pdf

These data is as follows: Dots are the measured data, curves fits to this data to extrapolate to the final value. Red dots give the mean of the ~60 remaining runs, while blue give the mean - 1 standard error, and green the mean + 1 standard error. The LV thermostat clearly shows a long time bias, which prevents the onset of a constant long-time diffusivity and this effect is significant within the error bar. The fitted lines show what the effect should be. The LB thermostat does not seem to have this problem, but here it is more difficult to obtain long-time data (as evidenced by the larger error bar), so it is unclear whether this issue is not also present there. That this is indeed an error and not just something that will level out at longer times is shown by the angular velocity autocorrelation function (AVACF), which is given below.

LV_DR_time_curves_no_problem.pdf LB_DR_time_curves_no_problem.pdf

These show the desired plateau value for the rotational diffusion coefficient within the error bar. Interestingly, while the effect is almost not appreciable in the AVACF, including the "bad" data sets for the VACF, also mess up the average extracted value of the AVACF derived rotational diffusion coefficient. So whatever error there is in the Langevin thermostat that does this, seems to propagate down to the rotational component as well.

Any help/suggestions would be most appreciated.

Best Wishes,

Joost

RudolfWeeber commented 7 years ago

Did you use correlators checkpointing? Because the release notes state that that is broken "with certtttain combinations of features" If so, does the problem also occur in simulations without checkpointing?

jdegraaf commented 7 years ago

Hi Rudolf,

Indeed I did. "certain combinations of features" is a bit vague though, and I might have one of these combinations, but how can one tell? The simulations are so long I cannot realistically run them on any machine without the checkpointing. Although I could be a jerk and just take a few of the cip machines and run it in the background, but since the error files happen 2 to 3 times in 40 runs, it will take a long time to 'verify' that a non-checkpointed simulation does not have errors.

Henri and I are currently converting the script to Python, so that I can simulate with the latest version of the software, as some issues might be due to the thermostat. Patrick suggested that Flo had worked on this over the past few months, and I was running a rather old version of the code, so that I should try the simulations with the latest version. Axel pointed a finger to electrostatics, so my attempt to solve this problem will focus on pure Langevin with the latest Python version of ESPResSo. Then add electrostatics, and then add activity back in. The weird thing is that my earliest simulations did not have the odd runs, but I no longer have the specific git hash with which they were performed.

By the way, I haven't seen a good example of the Python checkpointing (or for the correlators in Python). Do you have an example of this, or is it as of yet a non-existent feature? The h5md part in the UG was not particularly insightful. Is it in one of the tutorials? I don't recall ever having seen it there.

Best Wishes,

Joost

On 15 November 2016 at 13:57, RudolfWeeber notifications@github.com wrote:

Did you use correlators checkpointing? Because the release notes state that that is broken "with certtttain combinations of features" If so, does the problem also occur in simulations without checkpointing?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/espressomd/espresso/issues/904#issuecomment-260647532, or mute the thread https://github.com/notifications/unsubscribe-auth/ACuLn-re6BaO--bH8Em7naT_yxT4xu98ks5q-bozgaJpZM4KuinR .

RudolfWeeber commented 7 years ago

Hi Joost,

Indeed I did. "certain combinations of features" is a bit vague though, and I might have one of these combinations, but how can one tell? I don't know. There is, afaik, no github issue describing it in detail.

The simulations are so long I cannot realistically run them on any machine without the checkpointing. Although I could be a jerk and just take a few of the cip machines and run it in the background, but since the error files happen 2 to 3 times in 40 runs, it will take a long time to 'verify' that a non-checkpointed simulation does not have errors. What does "long" mean? A few days? Can you please provide the respective tcl scripts?

Henri and I are currently converting the script to Python, so that I can simulate with the latest version of the software, as some issues might be due to the thermostat. The Python branch has the same Langevin and LB code as the Tcl branch, to my knowledge.

By the way, I haven't seen a good example of the Python checkpointing (or for the correlators in Python). Checkpointing in general: samples/python/save_checkpoint.py and load_checkpoitn.py Correlators: tutorials/python/01... We'll hold a documentation day soon. Then, there should be some more docss

Correlator checkpointing is currently not supported in Python, as the current code is considered to be broken in an unknown fashion. We are going to reimplmement it via bosst-serialization which whould be much more rubst.

The h5md part in the UG was not particularly insightful. Is it in one of the tutorials? I don't recall ever having seen it there. H5md support for python is not yet merged. See the open pull request. Partilces, the system class and interactions have pickle support though and can be checkpointed with that. There are also amples in samples/python.

As I said, docs will improv soon.

Rudolf

jdegraaf commented 7 years ago

Hi Rudolf,

Thanks for this. Long indeed means a few days, depending on the machine; I want to get sufficient statistics to extract the long-time (angular) diffusion coefficient from the (angular) velocity autocorrelation function. I'll check out the samples in the python directory for the checkpointing, thanks again. Please find attached the original Tcl scripts. LBCH is the charged LB variant, LVCH is the Langevin variant

They are typically called as:

swimmer_measure_lbch.tcl 72.0 3.45 0.01 0.03 1.0 15 0.0 raspberry_R_3.0_Reff_3.45_hollow.dat 200 0 swimmer_measure_lvch.tcl 72.0 3.45 0.01 0.03 1.0 15 0.0 raspberry_R_3.0_Reff_3.45_hollow.dat 200 0

for a system without activity and my standard parameters. Change to

swimmer_measure_lbch.tcl 72.0 3.45 0.01 0.03 1.0 15 10.0 raspberry_R_3.0_Reff_3.45_hollow.dat 201 0 swimmer_measure_lvch.tcl 72.0 3.45 0.01 0.03 1.0 15 10.0 raspberry_R_3.0_Reff_3.45_hollow.dat 201 0

for moderate activity (200/201) are the run numbers.

Best Wishes,

Joost

uschille commented 7 years ago

Re: 2 ##### The Langevin thermostat is probably broken

Did you single out potential drift/center of mass motion?

My second guess would be the RNG.

jonaslandsgesell commented 7 years ago

I checked the RNG at least for the uniform distribution [0,1) and it was ok.

rbardak commented 7 years ago

2 ##### The Langevin thermostat is probably broken

I just ran a parameter sweep over kT and gamma for the langevin thermostat and stokes-einstein is wrong by a factor of sqrt(3). I will open an issue after I made the data more presentable.

fweik commented 7 years ago

Robin can you describe exactly what you did?

rbardak commented 7 years ago

I measured the componentwise msd of a single particle's position for different sets of kT and gamma of the langevin thermostat. Added them up and divided by 6 (this is <x^2+y^2+z^2> / 6 = Dt). I then used a linear fit to determine the diffusion constant outside of the ballistic regime. The diffusion constant is supposed to be kT/gamma if I'm not mistaken, but that's not what I measured.

jdegraaf commented 7 years ago

Dear All,

Thanks for the tests. I would be surprised if the test Robin performed failed or gave an inconsistent result. This was extensively double checked over the years and is not an issue for me. I get the right result to within 5% if I use the integral over the velocity autocorrelation function, but I have also checked the MSD and that gives identical results (within the error bar). Preliminary testing, and I hope to have the final result tomorrow, seems to point the blame squarely at the checkpointing of the correlator. But I need some more time to collect data and make a definitive statement.

Best Wishes,

Joost

On 21 November 2016 at 14:55, Robin Bardakcioglu notifications@github.com wrote:

I measured the componentwise msd of a single particle's position for different sets of kT and gamma of the langevin thermostat. Added them up and divided by 6 (this is <x^2+y^2+z^2> / 6 = D). I then used a linear fit to determine the diffusion constant outside of the ballistic regime. The diffusion constant is supposed to be kT/gamma if I'm not mistaken, but that's not what I measured.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/espressomd/espresso/issues/904#issuecomment-261959727, or mute the thread https://github.com/notifications/unsubscribe-auth/ACuLnzxF9qwjjG8P3lHsyPHdggKsRY8bks5rAbD2gaJpZM4KuinR .

uschille commented 7 years ago

Just a wild guess: Checkpointing of the correlator may require something similar to thermo_heat_up and thermo_cool_down to get the variances right?

jdegraaf commented 7 years ago

Dear All,

Testing further with the Langevin (per Particle) thermostat and my generic setup, I found that without checkpointing under the latest version of the master (f5d217c1968667217026e6c3370a66024fc41744), the velocity autocorrelation function (VACF) and angular velocity autocorrelation function (AVACF) are better behaved. That is, as long as the correlator is not using a checkpoint, the trends look better. However there are still some issues left.

VACF

This image shows that there is a lot of spread in the time=0 values of the VACF (time on x-axis, VACF on y-axis):

passive_langevin_vacf_linear

This really should not be the case for the number of samples used. That is, equipartition states very clearly what the average is supposed to be. Each curve represents an average of 2.5*10^6 samples, and still there is a spread of close to 10% in the individual averages (about 200 runs).

The translational diffusion coefficient that follows from this (time on x-axis, diffusion coefficient on y-axis):

passive_langevin_tdc_loglin_fit

The plateau value is far more convincing without relying on correlator checkpointing, but there is still some downward trend towards high times. Although, within the error bar and considering just how long the time scales are which are being sampled, this seems okay to me. It could also be related to the way I integrate up the VACF to obtain the time-dependent diffusivity.

AVACF

This image shows that there is also a lot of spread in the time=0 values of the AVACF (time on x-axis, AVACF on y-axis):

passive_langevin_avacf_linear

Again, that should not be the case, but the massively deviating ones, i.e., those curves which clearly did not follow the trend of the average are not present in the uncheckpointed system.

The rotational diffusion coefficient that follows from this (time on x-axis, diffusion coefficient on y-axis):

passive_langevin_adc_loglin_fit

The plateau value is still there, as it was before. However, there appears to be some long-time oscillatory structure to the data. Again, I presume this is either due to the integration of the AVACF (in this case) with logarithmic distribution of time samples, or acceptable within the error bar.

My suggestion is to put a warning message in the correlator checkpointing, similar to the one that is output for the reuse forces option in the LB. That is, just something that spits out, "Correlator checkpointing is broken in ESPResSo for an unknown combination of features, which may include yours!"

If anyone has any ideas on the massive spread in the zero-time (A)VACF, I would love to hear it.

Best Wishes,

Joost

jdegraaf commented 7 years ago

Hi Rudolf,

I've checked once more the more complicated system. Unfortunately, the strange long-time decay in the VACF is still there in the current python master for which I use the correlator to obtain samples, but I do not use checkpointing:

latest_python_master_vacf

and the mean value is off from the expected result as well. But the AVACF is fine:

latest_python_master_avacf

Similarly, the ion distribution around the charged colloid is not quite right:

current_rdf_negative

whereas it was correct in the previous version

previous_rdf_negative

This is is slightly annoying. However, the VACF situation could be a sampling issue still, as it takes a very long time to collect data for this system, but for the RDF I am less inclined to believe it is only a sampling issue. I will turn my attention to Henri's trajectory recorder next and see if I can get better statistics for the VACF and MSD without using the correlator.

Best Wishes,

Joost

fweik commented 7 years ago

See other thread.