GOMC-WSU / GOMC

GOMC - GPU Optimized Monte Carlo is a parallel molecular simulation code designed for high-performance simulation of large systems
https://gomc-wsu.org
MIT License
76 stars 36 forks source link

Nan produced by Molecular Transfer near edge of box #467

Open GregorySchwing opened 2 years ago

GregorySchwing commented 2 years ago

Describe the bug The molecular transfer move when calculating the weights of the CBMC trials, sums all the weights of each trial. The problem is one of the trials has a value of Nan for inter energy and infinite real space energy, producing a nan weight causing GOMC to crash.

I was able to prevent this from happening by only adding the finite weights to the sum

double stepWeight = 0; uint numFiniteLJTrials = calculateStepWeight(nLJTrials, stepWeight, ljWeights, data->ff.beta, inter, real); uint winner = prng.PickWeighted(ljWeights, nLJTrials, stepWeight); newMol.UpdateOverlap(overlap[winner]); newMol.MultWeight(stepWeight / numFiniteLJTrials);

normalizedStepWeight = sum/numberOfFiniteWeights.

I only did this for the DCSingle file. unfortunately these types of summations are peppered throughout the DC files, .

Also, I am concerned that I may need to track the numberOfFiniteWeights to maintain detailed balance.

To Reproduce Running the attached GCMC simulation of the water shell with tip3 water. bugFix_bad_CBMC_trial_Shouldnt_Kill_GOMC.zip

Expected behavior If a trial produces an an extremely poor energy, weight it with weight 0; instead of crashing GOMC.

Screenshots Green sphere indicates position of growing molecule.

Screenshot from 2022-10-06 13-35-40 Screenshot from 2022-10-06 13-50-23

Input files build dev version and run with gomc_production....conf

Please complete the following information:

Additional context Add any other context about the problem here.

GOMC-WSU commented 2 years ago

Ok, so why are we getting infinite energy for one of the trials?

Also, I am concerned that I may need to track the numberOfFiniteWeights to maintain detailed balance.

yes, you will have to do this. But first, we need to figure out what we are even getting an infinite energy. That should not happen. It would be better to set the weight of a very bad configuration to zero and keep it in the sum.

GregorySchwing commented 2 years ago

Ok, so why are we getting infinite energy for one of the trials?

Also, I am concerned that I may need to track the numberOfFiniteWeights to maintain detailed balance.

yes, you will have to do this. But first, we need to figure out what we are even getting an infinite energy. That should not happen. It would be better to set the weight of a very bad configuration to zero and keep it in the sum.

One of the randomly generated trial coordinates is identical to a system coordinate, resulting in a distSq of 0.

Then when we calculate the LJ energy

inline double FFParticle::CalcEn(const double distSq, const uint index) const { double rRat2 = sigmaSq[index] / distSq;

rRat2 is undefined, which returns +Nan.

This could be prevented with a simpler change of this

if (forcefield.rCutSq < distSq) return 0.0;

to this

if (forcefield.rCutSq < distSq) return 0.0; else if (distSq == 0.0) return INFINITY;

The exact same coordinate to double precision is being generated at random, and makes me think the RNG isn't being incremented.

Since this a restarted GOMC simulation, but not restart from checkpoint GOMC simulation, and an intseed is provided, GOMC reuses the same random numbers since both start from step 0.

Basically on GOMC run 1, step 12 successfully transferred a molecule into position xyz (which is a function of step # and seed).

Then on GOMC run 2, it attempted to transfer a molecule into position xyz (which is a function of step # and seed), which is the identical position as the molecule already there.

Hence distSq == 0, and we divide by distSq in the LJ Particle inter calculation.

I recommend we do a find/replace of

if (forcefield.rCutSq < distSq) return 0.0;

with

if (forcefield.rCutSq < distSq) return 0.0; else if (distSq == 0.0) return INFINITY;

Also

if (forcefield.rCutCoulombSq[b] < distSq) return 0.0;

with

if (forcefield.rCutCoulombSq[b] < distSq) return 0.0; else if (distSq == 0.0) return INFINITY;

For IEEE floats, division of a finite nonzero float by 0 is well-defined and results in +infinity (if the value was >zero) or -infinity (if the value was less than zero). The result of 0.0/0.0 is NaN. If you use integers, the behaviour is undefined.

GOMC uses the c++ exp() function, which doesn't mind INF values, it's just NAN than cause GOMC to crash.

GregorySchwing commented 2 years ago

Alternatively, we could print out an error informing the user they are reusing random numbers and their simulation is going to be correlated with the previous GOMC run they are restarting from.

The error message would include instructions to pass in

"InitStep lastStepPreviousRunGeneratedARestartFile"

or

"Checkpoint true checkpoint.file"

in the conf file.

LSchwiebert commented 2 years ago

I think it's a bad idea to allow someone to restart the simulation with the same random sequence/seed that was used at step 0. I can't think of a good reason for wanting to do this, so it would be done only by mistake. I think it would be better to generate an error in this case or just handle it in the code without user-intervention if possible.

In any case, it would be a good idea to handle the degenerate case of distSq == 0.0. It seems, though, that any value that results in rejecting the move would be fine, so Infinity works.

GOMC-WSU commented 2 years ago

I agree with @LSchwiebert. Restarting with the same random number sequence as a prior simulation is bad practice, unless the goal is to debug something. But in that case, one would be using the checkpoint functionality, and expecting that behavior.

BTW, for production runs for publications, we should be starting each simulation with a unique random number seed (e.g. seeded by the system time, or by a well-defined, unique seed).

LSchwiebert commented 2 years ago

I think we should generate an error message and terminate if the user is restarting with the same random seed as described above. Alternatively, we could generate a warning if the user is running in expert mode, but we should be able to work around this for debugging purposes, such as using a checkpoint file or specifying the step as above, so that we get consistent random numbers for testing.

We should include the information on how to fix the config file as mentioned above.

LSchwiebert commented 2 years ago

In other words, I'm fine with generating an error in all cases, even expert mode.