Allow resume of calculations

pacho2 commented 2 years ago

I am seeing that slurm is close to kill a job running for 15 days and I wondered about how to resume the PBSA calculations after. Looking to --help I wondered if maybe -rewrite-output could do it... but I think it is not exactly the same :/ In https://valdes-tresanco-ms.github.io/gmx_MMPBSA/howworks/ I didn't find anything related

Thanks a lot for your helpl

marioernestovaldes commented 2 years ago

at the moment, it is not possible to resume the calculations as those will be rare cases, sorry about that... usually these calculations don't take that long... if this is the case, I recommend to run the calculation using chucks of the trajectory, and then you can process everything together at the end

cheers!

Mario E.

pacho2 commented 2 years ago

In my case, with 100 frames it takes only few hours to run... but with 1000... it goes to more than 15 days... I don't know exactly why :/

Is there any documentation for running in chunks, saving the data, and finally process the files together at the end?

Thanks a lot :)

Valdes-Tresanco-MS commented 2 years ago

@pacho2 I'm not clear why it takes so long. Are you using the same configuration for both cases? That is, the number of CPUs is the same? Because it makes me think that you are running out of resources and your system is overloaded. In the tests we have done, the PB model, depending on the size of the system and the parameter settings, was able to take quite a lot of resources (1.9GB per thread for the Protein-Ligand example of the examples, 10 CPUs ~= 19 GB of RAM). This means that if you use a lot of threads, you may run out of RAM, which would cause chunks of the information of each thread to be saved and compete with each other to carry out the calculations. This can lead to system inconsistencies, memory errors, and overall inefficiency. Please check if it is the problem.

Regarding summarizing the calculations. We have it in our plans, however, it would take a long time to do it and we are not entirely convinced that it is feasible. Therefore, we believe that it is more feasible to dedicate the little time we have to other more important functions.

Mario S.

pacho2 commented 2 years ago

I am running the same setup -> 56 threads (two nodes of 28 cores) with slurm configured to limit memory at 128GB. I don't get it killed neither because of trying to surpass that RAM amount of memory or anything else :/ If I login into the nodes while running, I see there are still like 66Gb of RAM free and I cannot see anything abnormal in dmesg output :/

The only change I did is to modify the input file to get ~1000 frames instead of ~100. For example now I am trying to 500 frames (for now running for 19 hours):

&general
sys_name="Prot-Lig-CHARMM",
# Some papers use only 100 snapshots, but we will try to do 1000 as https://doi.org/10.1021/acs.jcim.9b00245
#startframe=8000, endframe=10000, interval=20, verbose=2,
startframe=9000, endframe=10000, interval=2, verbose=2,
# https://valdes-tresanco-ms.github.io/gmx_MMPBSA/examples/Entropy_calculations/Interaction_Entropy/
#entropy=2, entropy_seg=25, 
temperature=310
/
&pb
# radiopt=0 is recommended which means using radii from the prmtop file for both the PB calculation and for the NP
# calculation
# linit value from https://github.com/Valdes-Tresanco-MS/gmx_MMPBSA/issues/96
istrng=0.0, fillratio=4.0, radiopt=0, linit=10000

Valdes-Tresanco-MS commented 2 years ago

I see that the variable linit=10000. If I remember correctly @marioernestovaldes suggested this change since there was no convergence in the PB calculations. This value is 10 times higher than usual, so it is to be expected that it will take longer than the usual time. Although the trajectory pre-processing, structure, and topologies generation are only done on a single core, it shouldn't take that long. So it seems a bit strange to me, that big difference between 500 frames/19 hours vs 1000/15 days. Also note that the trajectory must not have PBC, because it could considerably increase the computation time. We, for our article, carry out the calculation on 200,000 frames, although with GB. gmx_MMPBA showed no inconsistencies, so the problem could be in the configuration, system size, and trajectory.

pacho2 commented 2 years ago

Hi,

I was using linit=10000 also for the case with only 100 frames that was taking half an hour to finish ( https://github.com/Valdes-Tresanco-MS/gmx_MMPBSA/issues/96#issuecomment-982987192 )

Also, it's exactly the same trajectory ,but trying to calculate it with more frames than 100...

Regarding the PBC, in all the cases I am running it with a trajectory processed by echo 0 | gmx trjconv -f step7.xtc -o step7_mol.xtc -pbc mol -s step7.tpr

Should I process it in a different way?

Valdes-Tresanco-MS commented 2 years ago

The fitting of the trajectory depends on the characteristics of your system. I am not entirely familiar with your system, so I will exchange with @marioernestovaldes who is the one who has helped you. However, the easiest way to check if your system has PBC is by visualizing the trajectory in VMD or with RMDS analysis. Although it doesn't scale linearly, it should be able to extrapolate the time for 100 frames to 1000 frames you want to do now. I would expect that for 1000 frames it would take about 6-8 hours (30 min * 1000/100 + 20-30% performance loss). So since you use the same trajectory at different intervals and the settings are the same, I think the problem may be the trajectory. Check it out and make sure it's consistent across the entire frame range.

If the problem persists, send us your files via Google Drive and we will try to find the source of the problem.

Best! Mario S.

pacho2 commented 2 years ago

Thanks a lot

I just confirmed it still takes 30 min with 100 frames and the same input settings... I am trying now with 200 to see where the problem starts (as also with 500 it was 19 hours, ,but it was still running)

Valdes-Tresanco-MS commented 2 years ago

Try running the calculations for 100 frames of different parts of the trajectory, for example, 100 frames from the beginning, intermediate, end, or with interval and see if it still takes only 30 minutes

pacho2 commented 2 years ago

I have always tried to extract the frames from the last 100 ns of the simulation. 100 frames work fine, I have tried from 400 to 500 ns, also 400 to 450 and 450 to 500. But as soon as I try to , for example, get 200 frames (changing "interval" value), it goes from taking ~30 mins to not finish in days :/

You can access the files here: https://drive.google.com/drive/folders/1Rg70uIV6lp3OJ6rUXzmhy3nm6gbQsiXq?usp=sharing

I think I included everything needed... but feel free to tell me if something is needed

Thanks a lot :)

Valdes-Tresanco-MS commented 2 years ago

I sent you a request to access the files

pacho2 commented 2 years ago

I just granted it. Also, launch_job_matrics-gmx_MMPBSA.sh is the slurm script

Best regards

Valdes-Tresanco-MS commented 2 years ago

I am going to do some tests and I will let you know as soon as I have the results

Valdes-Tresanco-MS commented 2 years ago

After doing many tests we have not been able to identify the problem. After refitting the trajectory, the calculations work fine up to a point where they stop converging. I'll try to let the calculations run for a while to be able to identify in which frame it starts to fail. Sorry for the delay, but debugging this case has been quite complex. The problem is in the convergence of PB, because it does not happen with GB. We have increased the value of linit to 20,000 but it still persists. We will do more tests. So far we have noticed that the lipid patch is quite small and the conformational changes of the peptide can make it quite close to the border, which could have a marked effect on the interaction energy.

pacho2 commented 2 years ago

Thanks a lot to you for the huge effort and help :)

Valdes-Tresanco-MS / gmx_MMPBSA

Allow resume of calculations #106