ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
486 stars 182 forks source link

ASE-MD terminates with "CUDA out of memory" #174

Closed bapatist closed 7 months ago

bapatist commented 12 months ago

Hello all, My ASE-MD in NPT ensemble for a 2124 atoms simulation crashes after running for 17.5 picoseconds. The system contains 3 atomic species for a solid-water interface. It took ~2h 45m before crashing with the "CUDA out of memory" error. The system does not explode with stable velocities till the very end. I used a single GPU on a shared node for this task:

#SBATCH --gres=gpu:a100:1
#SBATCH --cpus-per-task=18
#SBATCH --mem=125000
export OMP_NUM_THREADS=18

I tried running an NVT on the final structure but this time on a full node with 4 GPUs and 500GB memory which also results in the same out-of-memory crash after a stable run for 135 picoseconds (taking 23 hours).

For more details, attached are the relevant files for the single GPU run case. ase_npt.py.txt err.txt slurm_out.txt run.txt

Any help/discussion will be much appreciated. Thank you!

ilyes319 commented 12 months ago

Thank you for flagging that! The only thing I can think of now would be a sudden increase in density, that would create more edges and therefore more memory footprint. Could you look at the average number of neighbors near the end to see if there is a huge spike?

bapatist commented 12 months ago

I see, but if it also appears in NVT, the average neighbours density should stay the same right? (no vacuum region appears the box). I did use a bigg-ish r_max value (=6) for this. Maybe going down to something like 4.5 can help reducing #edges.

Edit: Just plotted RDFs and I don't see a spike towards the end of the simulation.

ilyes319 commented 11 months ago

Can you tell me what branch are you using?

bapatist commented 11 months ago

The develop branch

ilyes319 commented 11 months ago

Can you try to run the same MD with calculator on the main branch? No need to retrain, just pull the main branch and change model_paths to model_path.

bapatist commented 11 months ago

Okay, so I observe interestingly different behaviour from the main branch. The simulation didn't crash but was 40% slower compared to the run using the develop branch. I couldn't test if there was still a memory leak since I hit wall time. I will rerun for max wall time and report back if I ever reach an "out-of-memory" crash. Is the speed difference expected?

davkovacs commented 11 months ago

Speed difference comes from the neighbourlist I think.

ilyes319 commented 11 months ago

yes it is the effect of the matscipy neighbourlist.

ilyes319 commented 11 months ago

Do you have the same memory problem in main @bapatist ?

bapatist commented 11 months ago

I haven't encountered it on main branch yet. But I did not test on the big and long simulation. We have moved on to using LAMMPS on develop branch, I will report again if I see any memory problems.

ilyes319 commented 10 months ago

@bapatist Do you have any update on this issue? Did you experience any memory leak again?

bapatist commented 10 months ago

No new updates since I switched to LAMMPS for all MD tasks. @jungsdao experienced this the latest in our group. I'll check with him once as well.