ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
412 stars 155 forks source link

CUDA out of memory from MD with MP-0 pre-trained model using 'big' geometry. #487

Open turbosonics opened 1 week ago

turbosonics commented 1 week ago

Describe the bug I tried to run LAMMPS MACE MD simulation with MACE-MP-0 pre-trained model (2024-01-07-mace-128-L2_epoch-199.model-lammps.pt, more precisely) from local GPU cluster environment, using a single GPU node (64 cores per node & 7.75gb mem per core ).

I tested CaMg3O4 system. I found out 520 number of atoms with 4x4x4 supercell are OK and LAMMPS MD runs well using the pre-trained model. But when I even more supercell the system to make it more than 10k number of atoms and run MD simulation with the same model and the same 300K NVT MD setting, then CUDA OOM crash occur: RuntimeError: CUDA out of memory. Tried to allocate 16.88 GiB. GPU

This happens immediately after I submit the job, the LAMMPS couldn't even perform CG minimization, not even reached to the MD simulation stage. Here's my LAMMPS scripts:

# Setting
units           metal
atom_style      atomic
atom_modify     map yes
newton          on
dimension       3
#boundary        p p p
read_data       CaMg3O4_VASP_cellOPT_geoOPT_k666_c600_12x12x12_shifted.lmpdata
#read_restart

pair_style      mace no_domain_decomposition
pair_coeff      * * 2024-01-07-mace-128-L2_epoch-199.model-lammps.pt Ca Mg O
neighbor        1 bin
neigh_modify    every 10 delay 0 check no
timestep        0.001

# Minimization
log             log.01_opt
thermo_style    custom step cpu temp fmax fnorm pe ke density press pxx pyy pzz lx ly lz xlo xhi ylo yhi zlo zhi xy xz yz cella cellb cellc cellalpha cellbeta cellgamma
thermo          1
thermo_modify   norm no flush yes
dump            d01_opt all custom 1 dump_01_opt.lammps id type x y z fx fy fz
dump_modify     d01_opt sort id
min_style       cg
minimize        1.0e-7 1.0e-7 10000 10000
undump          d01_opt
write_data      data.after01_opt
write_restart   after01_opt.restart
reset_timestep  0

# NVTMD
log             log.02_relax
thermo_style    custom step cpu temp pe ke density press pxx pyy pzz lx ly lz xlo xhi ylo yhi zlo zhi
thermo          100
thermo_modify   norm no flush yes
restart         1000 02_relax.restart
velocity        all create ${temp1} 500000 mom yes rot yes dist gaussian
dump            d02_relax all custom 1 dump_02_relax.lammps id type x y z fx fy fz
dump_modify     d02_relax sort id
fix             nvt1 all nvt temp ${temp1} ${temp1} ${Tdamp}
run             ${iter01}
undump          d02_relax
unfix           nvt1
write_data      data.after02_relax
write_restart   after02_relax.restart

I followed instruction in https://mace-docs.readthedocs.io/en/latest/guide/lammps.html to compile LAMMPS-MACE module, except I used cuda 11.8 and gcc 11.2.0 and cudnn 8.1.1 for our local environment. And I also load those modules when I submit MACE LAMMPS simulation jobs. Like I wrote, test simulations run well with hundreds of atoms using MP-0 pretrained model, but it only crashes with CUDA OOM from 10k number of atoms.

My questions are:

1) Is this CUDA OOM can be escaped by more GPUs? Or, is this CUDA OOM limited by training set & training condition (in this case the MACE-MP-0) so that I wouldn't escape from CUDA OOM crash with any LAMMPS MD job conditions?

2) If I fine-tune the MACE-MP-0 model with "big system" geometries, would that prevent CUDA OOM crash?

3) What would be the tip or hint to escape from CUDA OOM crash, in general?

Thanks

stargolike commented 1 week ago

At first, i think this is due to insufficient cuda memory. sometimes you need multi-gpu lammps. Then, i meet the problem too. and i think it cant estimate how many atoms can run at maximum. i use same parameters to train differemt system. However, training sets with more atoms and more complex structures can actually run larger systems in LAMMPS

turbosonics commented 1 week ago

At first, i think this is due to insufficient cuda memory. sometimes you need multi-gpu lammps. Then, i meet the problem too. and i think it cant estimate how many atoms can run at maximum. i use same parameters to train differemt system. However, training sets with more atoms and more complex structures can actually run larger systems in LAMMPS

Thanks. Are there any hyperparameters for training that can help to escape from OOM crash for larger system? High or low rmax or lmax, for example?

wcwitt commented 1 week ago

I think 10k atoms is probably too many for your machine with this model. The long-term answer is multi-GPU. In the short term, you could try things like making the model less complex or reducing rmax.

turbosonics commented 1 week ago

In the short term, you could try things like making the model less complex or reducing rmax. Thanks!

Is this means that I need to use the model from fine-tune training based on the MP-0 pretrained model with some less complicated training set & r_max of 4 or 5 (As far as I know, MP0 pre-trained model used r_max 6), to run MD simulation with 10k or more atoms? Am I understanding this correctly?

stargolike commented 6 days ago

在短期内,您可以尝试降低模型的复杂性或降低 rmax 等方法。 谢谢!

这是否意味着我需要使用基于 MP-0 预训练模型的微调训练模型,以及一些不太复杂的训练集和 4 或 5 的r_max(据我所知,MP0 预训练模型用于 6 r_max),以运行具有 10k 或更多原子的 MD 模拟?我理解正确吗?

using foundation_model=small or smaller r_max can do it