Closed jpotoff closed 3 years ago
@jpotoff I opened a pull request to fix this.
@GregorySchwing mention the issue number in your commit or conversation for your pull request, so later we know which commit or pull request fixes this issue.
Results from the current development branch are looking encouraging:
Running the latest development branch for 224-trimethylpentane. While the energies are about the same, they do not track exactly.
Additional data from some longer simulations. CPU multiparticle and CPU single molecule moves are sampling the same phase space and giving the same average energy. The GPU multiparticle code is giving incorrect results (not producing the same average energy). The spikes in the non-bonded energy are strange, and suggests a bug.
Ran AR with the latest version of development from 8/29. The GPU version has strange spikes in it that suggest we are occasionally accepting configurations that we should not.
Ok, this is going to make tracking down the GPU/CPU differences in the multiparticle move even more difficult. I ran two CPU multiparticle move simulations with the same random number seeds. While both produce correct averages, they do not follow the same trajectory!
I checked the single molecule moves, and with the same random number seed those trajectories do track exactly.
I'm also seeing some cases of the GPU accepting when the CPU rejects. I'll try to track it down, as it is happening on move 1066 for the first time, so shouldn't take that long to run a test case. How many CPU cores? I can't think of anything that could cause this problem other than a bug related to openMP, although I thought all the race conditions were fixed. Or, maybe some OOB memory access. Any idea where the results start to deviate? If it isn't too many moves, I could try running some tests, Please upload the system files if you want me to try that.
Running multiparticle on one core gives consistent results with the same random number seed.
Input files attached input.zip
Regarding the energy spikes produced by the GPU code. I'm having a hard time reproducing them consistently in the same place from restart files. But I have noticed that I only see them when simulating an odd number of argon molecules, such as 73, or 273.
Ok, this is going to make tracking down the GPU/CPU differences in the multiparticle move even more difficult. I ran two CPU multiparticle move simulations with the same random number seeds. While both produce correct averages, they do not follow the same trajectory!
I checked the single molecule moves, and with the same random number seed those trajectories do track exactly.
I'll conduct regression tests to see which commit CPU MP lost seed reproducibility. This is a recent bug.
That suggests we are probably running threads that are accessing invalid pair interactions with OOB memory accesses. It might be worth running cuda-memcheck to see if there is a problem somewhere. It should show up as an OOB memory access even in cases where no erroneous results are produced.
Looks like we may have a similar issue in the OpenMP code. Here are four trajectories started from the same random number seed. They track perfectly for the first 30K steps or so. Then deviate, which is not unexpected. But what I did not expect was to see this spike in the energy in one of the runs, which is clearly incorrect. Unfortunately, it's very difficult to reproduce this error consistently.
These are the input files for the 73 atom argon system that occasionally throws a spike in the energy. The tricky thing is in 9 runs, I only got it to show an energy spike once. The spike happens in both openMP (4-16 CPU cores) and on the GPU. I suspect we are getting an occasional overflow error. I will try rerunning on just one CPU to see if we can reproduce the energy spike.
More tests. I ran 20 single CPU jobs with the same random number seed. 19 of them followed exactly the same trajectory, but one split off around 32,000 steps.
I ran 20 4 CPU jobs, and they stuck together until 32000 steps, then split apart (averages look ok, though). But, 8 of the runs show random large energy spikes.
I got the same result on GPU as well. Trying to find the problem.
It is strange that things are happening at right around 32,000 steps in all cases. The single CPU jobs should not be doing something strange. I'm starting to think there could be a bug in the random number generator. Either that or a memory leak. But, how to track that down when 95% of single CPU jobs appear to be running fine.
I commented out the #pragma directive on lines 318-321 in CalculateEnergy.cpp, which has to do with the force calculation, and the multicore results are now showing exactly the same behavior as the single core calculations. 19 identical trajectories, and one that deviates. No large spikes in the energy.
OK. I looked at that loop carefully just now. All of the shared variables are ones that should be shared. I.e., vectors that are read only or read-only input parameters. There are no obvious problems with the reductions. There could be a bug, but at least it isn't something obvious. I will start a long run with Address Sanitizer to see if any memory access issues are found. The fact that it is happening at the same step number indicates something strange is happening then. Maybe a special code path or something. It will take at least a day for that to run.
Do you think we should add particleMol
, box
, forcefield
, num::qqFact
, and particleKind
to shared as well?
We can, but it is compiling without those. I think it automatically must share the data members of the class or something. You can try adding them as shared and see. I'm guessing it won't build and will complain about trying to share something that is already shared. But, there isn't any harm in adding them and seeing what happens. There are no private variables and default is none, so they aren't private.
BTW, where one of the trajectories splits off seems to be dependent on the number of molecules in the system. For 43 atoms, there is one trajectory splitting off at 12117 steps.
For 13 atoms, the 20 trajectories all match exactly when the #pragma at line 318 in CalculateEnergy.cpp is commented out. Put the pragma back in and the trajectories start splitting at 3000 steps.
More bizarre behavior. With the 13 atom system, I could get it to consistently split at configuration 2618. So I dropped a checkpoint file at 2600. Restarting from the checkpoint, the 20 trajectories stay together until step 8730.
Hmm. Not sure what to make of that. We should still be getting the same random stream, so it should be reproducible if it is based on the moves. I ran at least 25 simulations with 1 CPU for 35,000 steps and didn't see any cases where it was giving strange results. So, I set up a run for 50M steps. I figure that gives a reasonable chance of running into a strange configuration, unless it is a compiler optimization bug. I'm using gcc 7.3.0. What compiler are the two of you using?
I'm running on 4 CPUs. Intel 19.1 compiler. Very surprised at what happened when restarting from the checkpoint file.
I ran 200 times for 1 CPU as well and everything matched. Re-running it for another 200 with 4CPU.
It finished running and it seems like the divergence happen in different steps, however, it seem to always happen after 32000 steps.
If you change the number of molecules in the system, you can get the divergence to happen sooner, sometimes much sooner.
Can one of you post a snippet of output that shows the problem? I will use that to check my long run when it finishes. I hope 50M steps is enough to replicate the problem. If it doesn't throw an OOB memory error, but does produce wrong output at some point, that at least narrows things down. The fact that it is happening with the GPU code, the OpenMP code, and code without either suggests it is a bug in the code. Jeff, you had some cases where it happened with a single CPU with the OpenMP pragma commented out, right? It could be that a data structure is getting corrupted and taking a checkpoint reset things?? I'm sure once we find the bug, it will all make sense.
Have an idea. Since we know it is happening in one particular function, we can keep the OpenMP code, then repeat the computation in that loop without the OpenMP directive. Put the results in different variables, of course, and compare the results at the end of both. Need to compare allowing some round off, such as 1e-12 or something. We could also do the same thing with the GPU code, of course, but it might be easier with and w/o OpenMP as we can easily dump some data if we want. We could, as a first step,, print which value is different, which might help narrow down the problem.
Here is test 1. v2.60 development without changes (all pragmas active). I'm including the logfiles, inputfiles, and a plot of the non-bonded energy. 20 separate runs w/same random number seed. input_13_openmp.zip logfiles.zip
Here are the data from when the pragma on line 318 of CalculateEnergy.cpp is commented out. Same input files as previous example. One of the 20 runs pursues an alternate path. logfiles_nopragma.zip
And here are the runs started from a check point file from step 2600. The trajectories do not follow the same path "nopragma" test, but should have. checkpoint.zip logfile_checkpoint.zip
Assume an energy spike occurs due to a should-be-rejected move being accepted. Given that moves are accepted according to
rand() < uBoltz * MPCoefficient,
the process of evaluating the right-side must be incorrect. Given that NAN will always evaluate to false and INF is sign correct for evaluation, one of the terms must be evaluating to +INF.
The rules for multiplying by +INF are as follows:
By setting uBoltz -> x , since the range of uBoltz : y = e ^ (-x), is y > 0, we can eliminate all but case 1.
In the calculation of MPCoeff, there are 2 unbounded terms
and 2 bounded terms
I have written GoogleTests to evaluate edge cases for these terms, and I haven't found exact values that that produce +INF. I am encouraged by my 5 runs with the 13 argon system, https://github.com/GOMC-WSU/GOMC/files/5165691/input_13_openmp.zip, with the isfinite() condition uncommented back into the code. I also extended the simulation by a factor of 10 steps.
MultiParticle.h:402
if(!std::isfinite(w_ratio)) { // This error can be removed later on once we know this part actually works. std::cout << "w_ratio is not a finite number. Auto-rejecting move.\n"; return 0.0; }
I anticipate tomorrow I will be able to get a range of values for our unbounded terms which will produce +INF within values that are considered valid by our simulation as is. I intend on continuing to develop GUnit tests, and deploy a Travis CI server to perform automatic development unit testing initiated by pushing to git.
This is a promising way to track down the problem. Are you actually seeing this message? If so, can you trace backward and see where/why it is happening? Given the size of the system, we can't possibly have enough terms to be getting infinity just from multiplication of reasonable numbers. If we are sure it is coming from a specific function in CalculateEnergy.cpp, you could see which terms are returning infinity and then we can try to see why. But, you could also run in a debugger and use backtrace when the message occurs. This shouldn't be happening, as the computation should be the same, so we need to trace back and see what the root cause is.
@GregorySchwing Ok, good to see that you can reproduce the energy spikes with the small system and they show up quickly (< 2000) steps.
I have finally been able to produce a checkpoint file that reproducibly gives an energy spike at step 5093. Input files are in the attached zip file. Checkpoint file starts the simulation at step 5092. This is running with openmp on 4 cores.
Pair ( 4, 10 ) on move 5093 has a distanceSq of ~10 [ .1, 3, .1] . This results in a dominant term added to the Energy (277.56875223729924) and Force (135.53110412987087, -2480.7820281362601, -87.440916078568691). Tommorrow I will see what the translation is on 5092 (the previous move) to see how exactly we place such as energetically unfavorable, relative to other atoms, positions.
Back to the GPU. I simply cannot get reproducible results to create a checkpoint that will reliably produce the spike seen near step 1700000. These are two separate runs started from the same random number seed. input.zip
and the latest development code doesn't run (compiling with Intel 19.1) Info: Using Device Number: 0 Info: GOMC COMPILED TO RUN CANONICAL (NVT) ENSEMBLE. terminate called after throwing an instance of 'std::out_of_range' what(): _Map_base::at Abort (core dumped)
This problem is due to the patch I pushed yesterday. I attached a patched code that should build. Please let me know what value you are seeing for OpenMP version and I will patch the code appropriately. It seems there is a version of OpenMP that I didn't include.
GitHub won't allow me to attach a .cpp file. Please check your email.
5.4.0-42-generic - This was related to my ubuntu hardware.
This problem is due to the patch I pushed yesterday. I attached a patched code that should build. Please let me know what value you are seeing for OpenMP version and I will patch the code appropriately. It seems there is a version of OpenMP that I didn't include.
GitHub won't allow me to attach a .cpp file. Please check your email.
New version of Main.cpp works.
This is the output I'm getting:
Info: Host Name: potoff32.eng.wayne.edu CPU information: Info: Total number of CPUs: 6 Info: Total number of CPUs available: 6 Info: Model name: Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz Info: System name: Linux Info: Release: 3.10.0-1062.1.1.el7.x86_64 Info: Version: #1 SMP Tue Sep 3 10:41:00 CDT 2019 Info: Kernel Architecture: x86_64 Info: Total Ram: 15840.3MB Info: Used Ram: 15083.9MB Info: Working in the current directory: /home4/jpotoff/GOMC/AR/MP/NVT/GPU-t2 GPU information: Info: Device Number: 0 Info: Device name: GeForce GTX 1060 6GB Info: Memory Clock Rate (KHz): 4513000 Info: Memory Bus Width (bits): 192 Info: Peak Memory Bandwidth (GB/s): 216.624000 Info: Using Device Number: 0 Info: GOMC COMPILED TO RUN CANONICAL (NVT) ENSEMBLE. Info: Compiled with OpenMP Version 201611 Info: Number of threads 1
Thanks. Pushed a patch to the development branch to handle the OpenMP version more robustly. Shouldn't crash again. Not sure what version 201611 maps to. It is somewhere between 4.5 and 5.0. I'm guessing a draft of the OpenMP 5.0 standard. The gcc 7.3.0 supports 201511, which is the 4.5 spec.
https://www.openmp.org//wp-content/uploads/openmp-tr4.pdf
I believe 201611 is 5.0 rev 1, since it's release date is Nov 2016.
Thanks. Updated issue #219 with a patch that now outputs the version information for 201611.
Today I tried printing out various box energy components values: recip, real, and inter. I noticed small differences (6th or 7th decimal place) in these values from the start of the simulation. Running without electrostatics resolved the CPU/GPU difference in this case, at least for a small number of steps. Interestingly, commenting out the BoxReciprocalSetup and BoxReciprocal GPU kernels and using the CPU versions removed the differences I was seeing across multiple GPU runs for the real and inter terms. This is surprising, since these should not be dependent on that part of the code. I was not able, yet, to track down the reason. I don't think that this should be a problem for systems without electrostatics, but I thought I would mention it in case it is helpful.
Here are my latest data for SPC water. It seems that the latest development code (pulled on 9/4) is now producing a trajectory that will give the correct average energy in an NVT simulation with electrostatics
After the fixes, CPU and GPU seem to be matching for NVT 1000 molecule isobutane. However, the argon with 73 atoms still differs. I will continue working on it tomorrow.
I pulled the development branch, compiled and ran the multiparticle move on the GPU. It seems the GPU code is still not following the same trajectory as the CPU code: input.zip