SSAGESproject / SSAGES

Software Suite for Advanced General Ensemble Simulations
GNU General Public License v3.0
81 stars 28 forks source link

Can not use all cpus for one walker ABF method #8

Open sathishdasari opened 5 years ago

sathishdasari commented 5 years ago

Dear Sir, 1) I am trying to run a one walker ABF on ADP system. My system is consisting of 1 socket, 4 cores per socket and 2 threads per core (1x4x2=8 CPUs). But when I run the job using "ssages 1walker.json" it uses only 3 CPUs (from top command %CPU). How can I use all CPUs to get good performance?

2) When I run the same job on a system consisting of 2 sockets, 6 cores per socket, 2 threads per core (2x6x2=24 CPUs) it gives the following error:

Fatal error:
Your choice of 1 MPI rank and the use of 24 total threads leads to the use of
24 OpenMP threads, whereas we expect the optimum to be with more MPI ranks with
1 to 6 OpenMP threads. If you want to run with this many OpenMP threads, specify
the -ntomp option. But we suggest to increase the number of MPI ranks.

3) 2 walker job is running perfectly fine with full efficiency with the command mpirun -np 24 2walker.json on this system.

4) How to know the convergence of ABF method using this software? Does the simulation terminate automatically after it converges? If not, how to extend the simulation using this software?

5) How to extract the structures of the free energy minima from the trajectory? As we do not have any file which provides collective variable values to be printed along the simulation time, which is helpful to get the frame numbers in extracting the structures corresponding to particular minimum. Like COLVAR file in PLUMED Software.

mquevill commented 5 years ago
  1. When you call SSAGES without mpirun/mpiexec, you are only spawning one process. GROMACS sets OpenMP threads internally. There is code within GROMACS that will help choose the number of threads if unspecified. By default, it will try to use all available threads (8 in your case). However, each thread may not use 100% of the core, based on GROMACS's optimizations. For example, on my workstation, each core is only at ~82%, which is why using the percentage from top may appear that it is only using 3 CPUs. If you call top -H, each process will show its threads separately, so this should show 8 lines of ssages.

  2. GROMACS attempts to optimize the threads and ranks for your simulation; this error comes from GROMACS's attempts to optimize running parameters. One rank with 24 threads is often less efficient with GROMACS. For this, I would suggesting using multiple ranks. To specify this, change ssages 1walker.json to mpirun -np 4 ssages 1walker.json, to use 4 ranks, for example. This will use the MPI capabilities of GROMACS natively. [If, however, you would actually like to use 24 OpenMP threads, you can specify "-ntomp","24" within the "args" member of the .json file.]

  3. This is good to hear. In this case, you are specifying 24 MPI ranks, so GROMACS only assigns 1 OpenMP thread per rank.

  4. Currently, there is no criterion or indicator of convergence built into SSAGES. The development team has discussed various ways to do this, and is currently in-progress. To extend the simulation, add a JSON member to the method: "restart": true, which will read the files from the last run and continue from there. (If "restart" is false or unspecified, then the old files will be backed up once the new files are written.)

  5. You can set up a Logger that will print the CVs as the simulation proceeds. (Manual > Input Files > Simulation Properties > Logger) This can be helpful to track other CVs, while only sampling over a few. See below for the syntax:

"logger": {
        "frequency": 100,
        "output_file": "cvs.dat",
        "cvs": [0, 3]
}

If you have any further questions, please let us know!

sathishdasari commented 5 years ago

Thank you very much for your suggestions.

sathishdasari commented 5 years ago

Dear Sir,

  1. How to specify the logger for 2 walker simulation to print CVs with simulation time?
  2. When I try to restart a 2 walker simulation which is crashed in between it is giving the following error:
[mm3:06753] *** Process received signal ***
[mm3:06753] Signal: Segmentation fault (11)
[mm3:06753] Signal code: Address not mapped (1)
[mm3:06753] Failing at address: 0x428
  1. I tested restarting a 1 walker job which crashed in between and it restarts perfectly.
mquevill commented 5 years ago
  1. The JSON member "output_file" can take an array of strings. For two walkers, for example, you can use this:

    "output_file": ["cvs_w0.dat", "cvs_w1.dat"]
  2. Do you get this error right at the beginning of the simulation? Or does the simulation start and the error occurs somewhere in the middle of the simulation? I have been able to restart the included 2 walker ADP example without a segmentation fault. Make sure that you are restarting a simulation with the same details (method parameters, number of walkers, etc.). If you have changed something about the method, then the software might have incorrect data when trying to read the files in.

sathishdasari commented 5 years ago

Thank you.

sathishdasari commented 5 years ago

I was trying a 2 walker simulation of ADP in solvent. After some time the job was killed, displaying the following error.

*** Error in `ssages': free(): invalid pointer: 0x00000000012dc3e0 ***
sathishdasari commented 5 years ago

Dear Sir, When I was trying to extend a 2 walker simulation, it was displaying following error.

*** error in `ssages': corrupted size vs. prev_size: 0x00000000025c24d0 ***
mquevill commented 5 years ago

I'm afraid that these error messages aren't enough to help diagnose your problem. If there is more output surrounding these error messages, please copy as much as is relevant.

Or if your issue is reproducible, you can attach the files needed to run your simulation so that the development team can try to reproduce your issue. This way, we can try to debug whatever is happening in this system.

sathishdasari commented 5 years ago

Dear Sir, I could not share the files as the file size is more than 10MB. I just changed the args in 2walker.json file from

"args" : ["-s","-deffnm","adp"],

to

"args" : ["-s","-deffnm","adp","-cpi", "adp", "-append"],

and added

"restart" : true,

to the .json file. I used the following command to run the simulation on a system consisting of 2 sockets, 6 cores per socket, 2 threads per core (2x6x2=24 CPUs)

mpirun -np 24 ssages 2walker.json &

I am getting the following error:

*** Error in `ssages': corrupted size vs. prev_size: 0x0000000001e824a0 ***
[ccl2:22785] *** Process received signal ***
[ccl2:22785] Signal: Aborted (6)
[ccl2:22785] Signal code:  (-6)