GPU and CPU trajectories with multiparticle still are not matching

jpotoff commented 4 years ago

I pulled the development branch, compiled and ran the multiparticle move on the GPU. It seems the GPU code is still not following the same trajectory as the CPU code: input.zip

gpu_cpu_nb_e

GregorySchwing commented 4 years ago

@jpotoff I opened a pull request to fix this.

msoroush commented 4 years ago

@GregorySchwing mention the issue number in your commit or conversation for your pull request, so later we know which commit or pull request fixes this issue.

jpotoff commented 4 years ago

Results from the current development branch are looking encouraging: gpu_cpu_test_v2

jpotoff commented 4 years ago

Running the latest development branch for 224-trimethylpentane. While the energies are about the same, they do not track exactly.

energy_comp move_comp

jpotoff commented 4 years ago

Additional data from some longer simulations. CPU multiparticle and CPU single molecule moves are sampling the same phase space and giving the same average energy. The GPU multiparticle code is giving incorrect results (not producing the same average energy). The spikes in the non-bonded energy are strange, and suggests a bug. eng_comp_mb_sp eng_comp_cpu_gpu_mb_sp cpu_gpu_224TMP

jpotoff commented 4 years ago

Ran AR with the latest version of development from 8/29. The GPU version has strange spikes in it that suggest we are occasionally accepting configurations that we should not.
ecomp_ar_cpu_gpu

jpotoff commented 4 years ago

Ok, this is going to make tracking down the GPU/CPU differences in the multiparticle move even more difficult. I ran two CPU multiparticle move simulations with the same random number seeds. While both produce correct averages, they do not follow the same trajectory! ttraj_comp

I checked the single molecule moves, and with the same random number seed those trajectories do track exactly. traj_comp_single

LSchwiebert commented 4 years ago

I'm also seeing some cases of the GPU accepting when the CPU rejects. I'll try to track it down, as it is happening on move 1066 for the first time, so shouldn't take that long to run a test case. How many CPU cores? I can't think of anything that could cause this problem other than a bug related to openMP, although I thought all the race conditions were fixed. Or, maybe some OOB memory access. Any idea where the results start to deviate? If it isn't too many moves, I could try running some tests, Please upload the system files if you want me to try that.

jpotoff commented 4 years ago

Running multiparticle on one core gives consistent results with the same random number seed. traj_comp_one_core

Input files attached input.zip

jpotoff commented 4 years ago

Regarding the energy spikes produced by the GPU code. I'm having a hard time reproducing them consistently in the same place from restart files. But I have noticed that I only see them when simulating an odd number of argon molecules, such as 73, or 273.

GregorySchwing commented 4 years ago

Ok, this is going to make tracking down the GPU/CPU differences in the multiparticle move even more difficult. I ran two CPU multiparticle move simulations with the same random number seeds. While both produce correct averages, they do not follow the same trajectory!

I checked the single molecule moves, and with the same random number seed those trajectories do track exactly.

I'll conduct regression tests to see which commit CPU MP lost seed reproducibility. This is a recent bug.

LSchwiebert commented 4 years ago

That suggests we are probably running threads that are accessing invalid pair interactions with OOB memory accesses. It might be worth running cuda-memcheck to see if there is a problem somewhere. It should show up as an OOB memory access even in cases where no erroneous results are produced.

jpotoff commented 4 years ago

Looks like we may have a similar issue in the OpenMP code. Here are four trajectories started from the same random number seed. They track perfectly for the first 30K steps or so. Then deviate, which is not unexpected. But what I did not expect was to see this spike in the energy in one of the runs, which is clearly incorrect. Unfortunately, it's very difficult to reproduce this error consistently. traj_test

jpotoff commented 4 years ago

These are the input files for the 73 atom argon system that occasionally throws a spike in the energy. The tricky thing is in 9 runs, I only got it to show an energy spike once. The spike happens in both openMP (4-16 CPU cores) and on the GPU. I suspect we are getting an occasional overflow error. I will try rerunning on just one CPU to see if we can reproduce the energy spike.

input_73atoms.zip

jpotoff commented 4 years ago

More tests. I ran 20 single CPU jobs with the same random number seed. 19 of them followed exactly the same trajectory, but one split off around 32,000 steps.

I ran 20 4 CPU jobs, and they stuck together until 32000 steps, then split apart (averages look ok, though). But, 8 of the runs show random large energy spikes.

YounesN commented 4 years ago

I got the same result on GPU as well. Trying to find the problem.

LSchwiebert commented 4 years ago

It is strange that things are happening at right around 32,000 steps in all cases. The single CPU jobs should not be doing something strange. I'm starting to think there could be a bug in the random number generator. Either that or a memory leak. But, how to track that down when 95% of single CPU jobs appear to be running fine.

jpotoff commented 4 years ago

I commented out the #pragma directive on lines 318-321 in CalculateEnergy.cpp, which has to do with the force calculation, and the multicore results are now showing exactly the same behavior as the single core calculations. 19 identical trajectories, and one that deviates. No large spikes in the energy.

LSchwiebert commented 4 years ago

OK. I looked at that loop carefully just now. All of the shared variables are ones that should be shared. I.e., vectors that are read only or read-only input parameters. There are no obvious problems with the reductions. There could be a bug, but at least it isn't something obvious. I will start a long run with Address Sanitizer to see if any memory access issues are found. The fact that it is happening at the same step number indicates something strange is happening then. Maybe a special code path or something. It will take at least a day for that to run.

YounesN commented 4 years ago

Do you think we should add particleMol, box, forcefield, num::qqFact, and particleKind to shared as well?

LSchwiebert commented 4 years ago

We can, but it is compiling without those. I think it automatically must share the data members of the class or something. You can try adding them as shared and see. I'm guessing it won't build and will complain about trying to share something that is already shared. But, there isn't any harm in adding them and seeing what happens. There are no private variables and default is none, so they aren't private.

jpotoff commented 4 years ago

BTW, where one of the trajectories splits off seems to be dependent on the number of molecules in the system. For 43 atoms, there is one trajectory splitting off at 12117 steps.

jpotoff commented 4 years ago

For 13 atoms, the 20 trajectories all match exactly when the #pragma at line 318 in CalculateEnergy.cpp is commented out. Put the pragma back in and the trajectories start splitting at 3000 steps.

jpotoff commented 4 years ago

More bizarre behavior. With the 13 atom system, I could get it to consistently split at configuration 2618. So I dropped a checkpoint file at 2600. Restarting from the checkpoint, the 20 trajectories stay together until step 8730.

LSchwiebert commented 4 years ago

Hmm. Not sure what to make of that. We should still be getting the same random stream, so it should be reproducible if it is based on the moves. I ran at least 25 simulations with 1 CPU for 35,000 steps and didn't see any cases where it was giving strange results. So, I set up a run for 50M steps. I figure that gives a reasonable chance of running into a strange configuration, unless it is a compiler optimization bug. I'm using gcc 7.3.0. What compiler are the two of you using?

jpotoff commented 4 years ago

I'm running on 4 CPUs. Intel 19.1 compiler. Very surprised at what happened when restarting from the checkpoint file.

YounesN commented 4 years ago

I ran 200 times for 1 CPU as well and everything matched. Re-running it for another 200 with 4CPU.

YounesN commented 4 years ago

It finished running and it seems like the divergence happen in different steps, however, it seem to always happen after 32000 steps.

jpotoff commented 4 years ago

If you change the number of molecules in the system, you can get the divergence to happen sooner, sometimes much sooner.

LSchwiebert commented 4 years ago

Can one of you post a snippet of output that shows the problem? I will use that to check my long run when it finishes. I hope 50M steps is enough to replicate the problem. If it doesn't throw an OOB memory error, but does produce wrong output at some point, that at least narrows things down. The fact that it is happening with the GPU code, the OpenMP code, and code without either suggests it is a bug in the code. Jeff, you had some cases where it happened with a single CPU with the OpenMP pragma commented out, right? It could be that a data structure is getting corrupted and taking a checkpoint reset things?? I'm sure once we find the bug, it will all make sense.

LSchwiebert commented 4 years ago

Have an idea. Since we know it is happening in one particular function, we can keep the OpenMP code, then repeat the computation in that loop without the OpenMP directive. Put the results in different variables, of course, and compare the results at the end of both. Need to compare allowing some round off, such as 1e-12 or something. We could also do the same thing with the GPU code, of course, but it might be easier with and w/o OpenMP as we can easily dump some data if we want. We could, as a first step,, print which value is different, which might help narrow down the problem.

jpotoff commented 4 years ago

Here is test 1. v2.60 development without changes (all pragmas active). I'm including the logfiles, inputfiles, and a plot of the non-bonded energy. 20 separate runs w/same random number seed. input_13_openmp.zip logfiles.zip traj_test_13

jpotoff commented 4 years ago

Here are the data from when the pragma on line 318 of CalculateEnergy.cpp is commented out. Same input files as previous example. One of the 20 runs pursues an alternate path. logfiles_nopragma.zip traj_test13b

jpotoff commented 4 years ago

And here are the runs started from a check point file from step 2600. The trajectories do not follow the same path "nopragma" test, but should have. checkpoint.zip logfile_checkpoint.zip traj_checkpoint

GregorySchwing commented 4 years ago

Assume an energy spike occurs due to a should-be-rejected move being accepted. Given that moves are accepted according to

rand() < uBoltz * MPCoefficient,

the process of evaluating the right-side must be incorrect. Given that NAN will always evaluate to false and INF is sign correct for evaluation, one of the terms must be evaluating to +INF.

The rules for multiplying by +INF are as follows:

x * + INF = + INF, if x > 0
x * + INF = - INF, if x < 0
x * + INF = NAN, if x = 0

By setting uBoltz -> x , since the range of uBoltz : y = e ^ (-x), is y > 0, we can eliminate all but case 1.

In the calculation of MPCoeff, there are 2 unbounded terms

lb_new
t_k

and 2 bounded terms

lb_old
max

I have written GoogleTests to evaluate edge cases for these terms, and I haven't found exact values that that produce +INF. I am encouraged by my 5 runs with the 13 argon system, https://github.com/GOMC-WSU/GOMC/files/5165691/input_13_openmp.zip, with the isfinite() condition uncommented back into the code. I also extended the simulation by a factor of 10 steps.

MultiParticle.h:402 if(!std::isfinite(w_ratio)) { // This error can be removed later on once we know this part actually works. std::cout << "w_ratio is not a finite number. Auto-rejecting move.\n"; return 0.0; }

I anticipate tomorrow I will be able to get a range of values for our unbounded terms which will produce +INF within values that are considered valid by our simulation as is. I intend on continuing to develop GUnit tests, and deploy a Travis CI server to perform automatic development unit testing initiated by pushing to git.

LSchwiebert commented 4 years ago

This is a promising way to track down the problem. Are you actually seeing this message? If so, can you trace backward and see where/why it is happening? Given the size of the system, we can't possibly have enough terms to be getting infinity just from multiplication of reasonable numbers. If we are sure it is coming from a specific function in CalculateEnergy.cpp, you could see which terms are returning infinity and then we can try to see why. But, you could also run in a debugger and use backtrace when the message occurs. This shouldn't be happening, as the computation should be the same, so we need to trace back and see what the root cause is.

jpotoff commented 4 years ago

@GregorySchwing Ok, good to see that you can reproduce the energy spikes with the small system and they show up quickly (< 2000) steps.

jpotoff commented 4 years ago

I have finally been able to produce a checkpoint file that reproducibly gives an energy spike at step 5093. Input files are in the attached zip file. Checkpoint file starts the simulation at step 5092. This is running with openmp on 4 cores.

spike.zip

GregorySchwing commented 4 years ago

PAIR_4_10.pdf

Pair ( 4, 10 ) on move 5093 has a distanceSq of ~10 [ .1, 3, .1] . This results in a dominant term added to the Energy (277.56875223729924) and Force (135.53110412987087, -2480.7820281362601, -87.440916078568691). Tommorrow I will see what the translation is on 5092 (the previous move) to see how exactly we place such as energetically unfavorable, relative to other atoms, positions.

jpotoff commented 4 years ago

Back to the GPU. I simply cannot get reproducible results to create a checkpoint that will reliably produce the spike seen near step 1700000. These are two separate runs started from the same random number seed. input.zip

tgpu_repo

jpotoff commented 4 years ago

and the latest development code doesn't run (compiling with Intel 19.1) Info: Using Device Number: 0 Info: GOMC COMPILED TO RUN CANONICAL (NVT) ENSEMBLE. terminate called after throwing an instance of 'std::out_of_range' what(): _Map_base::at Abort (core dumped)

LSchwiebert commented 4 years ago

This problem is due to the patch I pushed yesterday. I attached a patched code that should build. Please let me know what value you are seeing for OpenMP version and I will patch the code appropriately. It seems there is a version of OpenMP that I didn't include.

GitHub won't allow me to attach a .cpp file. Please check your email.

GregorySchwing commented 4 years ago

5.4.0-42-generic - This was related to my ubuntu hardware.

jpotoff commented 4 years ago

This problem is due to the patch I pushed yesterday. I attached a patched code that should build. Please let me know what value you are seeing for OpenMP version and I will patch the code appropriately. It seems there is a version of OpenMP that I didn't include.

GitHub won't allow me to attach a .cpp file. Please check your email.

New version of Main.cpp works.

This is the output I'm getting: Info: Host Name: potoff32.eng.wayne.edu CPU information: Info: Total number of CPUs: 6 Info: Total number of CPUs available: 6 Info: Model name: Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz Info: System name: Linux Info: Release: 3.10.0-1062.1.1.el7.x86_64 Info: Version: #1 SMP Tue Sep 3 10:41:00 CDT 2019 Info: Kernel Architecture: x86_64 Info: Total Ram: 15840.3MB Info: Used Ram: 15083.9MB Info: Working in the current directory: /home4/jpotoff/GOMC/AR/MP/NVT/GPU-t2 GPU information: Info: Device Number: 0 Info: Device name: GeForce GTX 1060 6GB Info: Memory Clock Rate (KHz): 4513000 Info: Memory Bus Width (bits): 192 Info: Peak Memory Bandwidth (GB/s): 216.624000 Info: Using Device Number: 0 Info: GOMC COMPILED TO RUN CANONICAL (NVT) ENSEMBLE. Info: Compiled with OpenMP Version 201611 Info: Number of threads 1

LSchwiebert commented 4 years ago

Thanks. Pushed a patch to the development branch to handle the OpenMP version more robustly. Shouldn't crash again. Not sure what version 201611 maps to. It is somewhere between 4.5 and 5.0. I'm guessing a draft of the OpenMP 5.0 standard. The gcc 7.3.0 supports 201511, which is the 4.5 spec.

YounesN commented 4 years ago

https://www.openmp.org//wp-content/uploads/openmp-tr4.pdf

I believe 201611 is 5.0 rev 1, since it's release date is Nov 2016.

LSchwiebert commented 4 years ago

Thanks. Updated issue #219 with a patch that now outputs the version information for 201611.

LSchwiebert commented 4 years ago

Today I tried printing out various box energy components values: recip, real, and inter. I noticed small differences (6th or 7th decimal place) in these values from the start of the simulation. Running without electrostatics resolved the CPU/GPU difference in this case, at least for a small number of steps. Interestingly, commenting out the BoxReciprocalSetup and BoxReciprocal GPU kernels and using the CPU versions removed the differences I was seeing across multiple GPU runs for the real and inter terms. This is surprising, since these should not be dependent on that part of the code. I was not able, yet, to track down the reason. I don't think that this should be a problem for systems without electrostatics, but I thought I would mention it in case it is helpful.

jpotoff commented 4 years ago

Here are my latest data for SPC water. It seems that the latest development code (pulled on 9/4) is now producing a trajectory that will give the correct average energy in an NVT simulation with electrostatics

water_mp_test

YounesN commented 4 years ago

After the fixes, CPU and GPU seem to be matching for NVT 1000 molecule isobutane. However, the argon with 73 atoms still differs. I will continue working on it tomorrow.

gpu

GOMC-WSU / GOMC

GPU and CPU trajectories with multiparticle still are not matching #182