GOMC-WSU / GOMC

GOMC - GPU Optimized Monte Carlo is a parallel molecular simulation code designed for high-performance simulation of large systems
https://gomc-wsu.org
MIT License
76 stars 36 forks source link

Segfault when running MEMC move #143

Open jpotoff opened 5 years ago

jpotoff commented 5 years ago

@msoroush Running MEMC move V2 with single atom molecules (methane and argon) causes a segfault.

GOMC_input.zip

jpotoff commented 5 years ago

It seems that V3 works ok, but V2 does not. Running V3 requires faking the "backbone" by putting the same atom name twice.

jpotoff commented 5 years ago

It seems that V3 works ok, but V2 does not. Running V3 requires faking the "backbone" by putting the same atom name twice.

I take that back. V3 is also dumping core.

jpotoff commented 5 years ago

This is getting bizarre. If I set the console frequency to 1000 steps, it runs. Set to 10,000 or 100,000 steps and I get a segfault. Set console frequency to 10,001 (or 100,001) and it runs.

jpotoff commented 5 years ago

This is getting bizarre. If I set the console frequency to 1000 steps, it runs. Set to 10,000 or 100,000 steps and I get a segfault. Set console frequency to 10,001 (or 100,001) and it runs.

And now it doesn't follow that trend. Maybe something is going on with how the code is seeding the random number generator?

jpotoff commented 5 years ago

This is getting bizarre. If I set the console frequency to 1000 steps, it runs. Set to 10,000 or 100,000 steps and I get a segfault. Set console frequency to 10,001 (or 100,001) and it runs.

And now it doesn't follow that trend. Maybe something is going on with how the code is seeding the random number generator?

Turns out this is coming from a very bad initial configuration. Instead of the code identifying the error and gracefully stopping (with an error message), it just barfs.

msoroush commented 5 years ago

@jpotoff would you please share the output console file?

msoroush commented 5 years ago

@jpotoff @YounesN The segfault error is not related to MEMC move. The start pdb files has not been generated for the defined simulation box. You can see that from console output.

Info: Box  0:  Periodic Cell Basis 1     20.000   0.000   0.000 
Info: Box  0:  Periodic Cell Basis 2      0.000  20.000   0.000 
Info: Box  0:  Periodic Cell Basis 3      0.000   0.000  20.000 

Info: Box  1:  Periodic Cell Basis 1     50.000   0.000   0.000 
Info: Box  1:  Periodic Cell Basis 2      0.000  50.000   0.000 
Info: Box  1:  Periodic Cell Basis 3      0.000   0.000  50.000 

Minimum coordinates in box 0: x =    1.999, y =    1.987, z =    2.000
Maximum coordinates in box 0: x =   20.001, y =   20.001, z =   20.004
Wrapping molecules inside the simulation box 0:
Minimum coordinates in box 1: x =   -5.708, y =   -5.726, z =   -5.437
Maximum coordinates in box 1: x =   49.837, y =   47.627, z =   49.865
Wrapping molecules inside the simulation box 1:

GOMC will wrap the molecule using defined simulation box which leads to large nonbonded energy:

Warning: Large energy detected due to the overlap in initial configuration.
         The total energy will be recalculated at EqStep to ensure the accuracy 
         of the computed running energies.

Non of these problems should give us segfault. When I was debugging, I notice that we have similar problem as issue #123. Some coordinates after wrapping would be placed right on the box axis and we get segfault in CellList (out of range cell number). This is due precision problem.

CellList.h:172: box 0, pos: [2.0000e+01, 3.1560e+00, 9.9050e+00]
AxisDimensions: [2.0000e+01, 2.0000e+01, 2.0000e+01]
CellList.h:162: box 0, Out of cell
AxisDimensions: [2.0000e+01, 2.0000e+01, 2.0000e+01]
Segmentation fault: 11

@jpotoff this is consistence with the behavior that you saw (segfault at random steps).

@YounesN What do you think if check the cell position? We can change this https://github.com/GOMC-WSU/GOMC/blob/96bfa733c3fbf81061aaaa1887f271abe8f80542/src/CellList.h#L81-L89

to

inline int CellList::PositionToCell(const XYZ& posRef, int box) const
{
  //Transfer to unslant coordinate to find the neighbor
  XYZ pos = dimensions->TransformUnSlant(posRef, box);
  int x = (int)(pos.x / cellSize[box].x);
  int y = (int)(pos.y / cellSize[box].y);
  int z = (int)(pos.z / cellSize[box].z);
  int cell = x * edgeCells[box][1] * edgeCells[box][2] + y * edgeCells[box][2] + z;
  //Check the cell position to avoid segfault
  if(cell == head[box].size()) {
    --cell;
  }
  return cell;
}
jpotoff commented 5 years ago

This issue seems to be coming up a lot lately. I know it's not because of the MEMC move, but it's also happening with restart files that were produced from GCMC simulations with the MEMC move.

msoroush commented 5 years ago

//Check the cell position to avoid segfault if(cell == head[box].size()) { --cell; }

@YounesN This logic would not work because it returns the cell index in 1D array. We check check the cell index for each axis, like here:

inline int CellList::PositionToCell(const XYZ& posRef, int box) const
{
  //Transfer to unslant coordinate to find the neighbor
  XYZ pos = dimensions->TransformUnSlant(posRef, box);
  int x = (int)(pos.x / cellSize[box].x);
  int y = (int)(pos.y / cellSize[box].y);
  int z = (int)(pos.z / cellSize[box].z);
  //Check the cell number to avoid segfult for coordinates close to axis
 // x, y, and z should never be equal or greater than number of cells in x, y, and z axis, respectively. 
  x -= (x == edgeCells[box][0] ?  1 : 0);
  y -= (y == edgeCells[box][1] ?  1 : 0);
  z -= (z == edgeCells[box][2] ?  1 : 0);
  return x * edgeCells[box][1] * edgeCells[box][2] + y * edgeCells[box][2] + z;
}
msoroush commented 5 years ago

@jpotoff I commit the suggested fix to the development branch. Would you please pull and recompile the development branch. Please let me know if you saw similar problem. I reran your system multiple times but I did not get any segfault.

msoroush commented 5 years ago

This issue seems to be coming up a lot lately. I know it's not because of the MEMC move, but it's also happening with restart files that were produced from GCMC simulations with the MEMC move.

@jpotoff it's a random behavior. For example in @niloofartf system (issue #123) she is only using displacement and rotation move:

################################
# MOVE FREQUENCY              
################################
DisFreq              0.80  
RotFreq              0.20
RegrowthFreq         0.00
CrankShaftFreq       0.00

################################
# BOX DIMENSION #, X, Y, Z
################################
CellBasisVector1  0  60.0 0.0  0.0
CellBasisVector2  0  0.0  60.0 0.0
CellBasisVector3  0  0.0  0.0  180.0

In MEMC, we call wrap function to make sure the molecule would be in box. Any small round up error in the positionToCell function would give us segfault error. Let's see if this fix would prevent the segfault error.

jpotoff commented 5 years ago

I get that it's random. But why is the simulation sometimes throwing a molecule slightly outside the box? I'm getting restart files that result in a "molecule outside of the box" error.

msoroush commented 5 years ago

@jpotoff Can you send me your file. In my Test, I restart the simulation from Outputted Restart pdb file and I didn’t see any error. Because of the precision error in wrapping function, it puts the atom of they molecule slightly over the edge of the simulation box.

jpotoff commented 5 years ago

example.zip

msoroush commented 5 years ago

@jpotoff Is there any specific reason that you specify 2 atom for argon without defining a bond in parameter file?

    2549 C1A  2549 C1A  C1   CH4    0.000000       16.0430           0
    2550 C1A  2550 C1A  C1   CH4    0.000000       16.0430           0
    2551 AR   1    AR   AR   AR     0.000000       39.9480           0
    2552 AR   1    AR   AR   AR     0.000000       39.9480           0
    2553 AR   2    AR   AR   AR     0.000000       39.9480           0
    2554 AR   2    AR   AR   AR     0.000000       39.9480           0
    2555 AR   3    AR   AR   AR     0.000000       39.9480           0
    2556 AR   3    AR   AR   AR     0.000000       39.9480           0
    2557 AR   4    AR   AR   AR     0.000000       39.9480           0
    2558 AR   4    AR   AR   AR     0.000000       39.9480           0

The segfault is caused by CBMC, when it is trying to generate the second AR atom but there is no bond. In FreeEnergy branch we generate a warning:

Reading from box 0 PDB coordinate file:     ./C1A_BOX_0_restart.pdb
Reading from box 1 PDB coordinate file:     ./C1A_BOX_1_restart.pdb
Random number seed: 1964306007
Warning: Bond is missing for AR !

If you use the setup_box_0 and setup_box_1 PDB file, you should be fine. Maybe I can find a place to generate Error Message for this case.

jpotoff commented 5 years ago

That file was output from GOMC. Perhaps there is a bug in the output of the restart file? The original input does not look like that.

msoroush commented 5 years ago

@jpotoff I thought about it and check the restart output pdb and psf file using setup_box pdb and psf as an input file and everything was fine. Would you please build your system again and run it?

jpotoff commented 5 years ago

I know if I use setup_box_0, etc things are fine. The issue is with the restart file being produced by GOMC.