Amber-MD / cpptraj

Biomolecular simulation trajectory/data analysis.
Other
135 stars 64 forks source link

cpptraj GIST stucks after 90% of results written #1023

Closed EgorBulavkoSk closed 5 months ago

EgorBulavkoSk commented 1 year ago

Hello! I am trying to study the solvation thermodynamics of my protein. I have the latest CPPTRAJ version compiled with OpenMP. I use cluster node with 24 cores and 300 GB of RAM. The command is (skipping parm, trajin and stripping of CLA and SOD):

gist doeij gridcntr 38.155 35.63 35.175 griddim 54 54 61 gridspacn 0.75 out gist_1000-50000ww.out refdens 0.0329

It proceeds the trajectory successfully, computes both entropy terms, but upon writing GIST results for each voxel, it stucks on 90% (in out file, info for all except several dozens of voxels is written), and then nothing happens. RAM is occupied for less than 16%, "top" command shows CPU load around 100% rather than >1000% as on previous calculation steps. Eww_ij.dat file remains clear, and no one .dx file is written.

Interestingly, if doeij keyword in not specified, it is able to finish the calculation successfully. But I need Eww_ij.dat file, so cannot skip it. I attach a log file just in case, there I use a 49000-frames trajectory, but nothing changes even if I use only 500 frames. gist.log

I would be happy if someone could suggest a solution!

EgorBulavkoSk commented 1 year ago

UPD

I tried to play with grid size, and found out that the program succeeds if griddims are 40 40 40, but stucks if they are 50 50 50. I have no idea wether it is a bug or a problem with my cluster node as far as I have no any other computational stations with so much RAM.

drroe commented 1 year ago

Hi, very sorry for the delay on this - I've been swamped lately.

The fact that it succeeds with smaller grid dimensions sort of makes sense in that there are fewer voxels to print out (and fewer Eww values to print). For your grid that worked you've got 64000 voxels, so the upper-triangle water-water interaction matrix size is (N*(N-1))/2 = 2047968000 (already 8 ~GB). The larger grid of 54x54x61 is 15819846750 values (a whopping ~60 GB). I'm guessing what is happening is that it's just taking a very long time to write all those values in ASCII format. Does that make sense?

I need to add an option to GIST to skip ASCII and write the data in netcdf format or something. Thanks for bringing this to my attention.

EgorBulavkoSk commented 1 year ago

Thanks for reply! Actually, when using small grids there is no any delay between ending of calculations and starting writing a Eij_dat file. Ofc it takes plenty of time, and the file size increases gradually. But for larger grids, data writing even does not start. I am not a specialist, writing data in ascii format is smth preceding writing data in file? If so, why the occupied RAM does not change at all during it?

EgorBulavkoSk commented 1 year ago

And I want to mention one more time that program stucks on writing main output file. For instance, having 170000 voxels, it writes the info for +- 169980, and for voxel with index 169981 it starts writing parameters but does not finish, stucking randomly somewhere in the middle of the line.

drroe commented 1 year ago

And I want to mention one more time that program stucks on writing main output file. For instance, having 170000 voxels, it writes the info for +- 169980, and for voxel with index 169981 it starts writing parameters but does not finish, stucking randomly somewhere in the middle of the line.

But remember you mentioned that it does finish if you don't specify doeij, which indicates it's the water-water interaction calculation that is really the issue here. What you're seeing is likely a buffering issue; since the EIJ matrix write doesn't have a chance to finish (and hence the GIST output phase is not finished), STDOUT and the GIST output file don't have a chance for their buffers to be "flushed". I would bet that if you were to attach a debugger to your code where it appears stuck, it will be in the EIJ matrix write loop.

EgorBulavkoSk commented 1 year ago

Thanks for clarification! It is unfortunate that this problem is not something that can be fixed fast :(

After all, do I need to close this issue now?

drroe commented 5 months ago

Note that there have been some improvements to the GIST code in the past year. If you're still encountering issues please feel free to reopen this issue or post a new one.