Open turbosonics opened 1 week ago
At first, i think this is due to insufficient cuda memory. sometimes you need multi-gpu lammps. Then, i meet the problem too. and i think it cant estimate how many atoms can run at maximum. i use same parameters to train differemt system. However, training sets with more atoms and more complex structures can actually run larger systems in LAMMPS
At first, i think this is due to insufficient cuda memory. sometimes you need multi-gpu lammps. Then, i meet the problem too. and i think it cant estimate how many atoms can run at maximum. i use same parameters to train differemt system. However, training sets with more atoms and more complex structures can actually run larger systems in LAMMPS
Thanks. Are there any hyperparameters for training that can help to escape from OOM crash for larger system? High or low rmax or lmax, for example?
I think 10k atoms is probably too many for your machine with this model. The long-term answer is multi-GPU. In the short term, you could try things like making the model less complex or reducing rmax.
In the short term, you could try things like making the model less complex or reducing rmax. Thanks!
Is this means that I need to use the model from fine-tune training based on the MP-0 pretrained model with some less complicated training set & r_max of 4 or 5 (As far as I know, MP0 pre-trained model used r_max 6), to run MD simulation with 10k or more atoms? Am I understanding this correctly?
在短期内,您可以尝试降低模型的复杂性或降低 rmax 等方法。 谢谢!
这是否意味着我需要使用基于 MP-0 预训练模型的微调训练模型,以及一些不太复杂的训练集和 4 或 5 的r_max(据我所知,MP0 预训练模型用于 6 r_max),以运行具有 10k 或更多原子的 MD 模拟?我理解正确吗?
using foundation_model=small or smaller r_max can do it
Describe the bug I tried to run LAMMPS MACE MD simulation with MACE-MP-0 pre-trained model (2024-01-07-mace-128-L2_epoch-199.model-lammps.pt, more precisely) from local GPU cluster environment, using a single GPU node (64 cores per node & 7.75gb mem per core ).
I tested CaMg3O4 system. I found out 520 number of atoms with 4x4x4 supercell are OK and LAMMPS MD runs well using the pre-trained model. But when I even more supercell the system to make it more than 10k number of atoms and run MD simulation with the same model and the same 300K NVT MD setting, then CUDA OOM crash occur:
RuntimeError: CUDA out of memory. Tried to allocate 16.88 GiB. GPU
This happens immediately after I submit the job, the LAMMPS couldn't even perform CG minimization, not even reached to the MD simulation stage. Here's my LAMMPS scripts:
I followed instruction in https://mace-docs.readthedocs.io/en/latest/guide/lammps.html to compile LAMMPS-MACE module, except I used cuda 11.8 and gcc 11.2.0 and cudnn 8.1.1 for our local environment. And I also load those modules when I submit MACE LAMMPS simulation jobs. Like I wrote, test simulations run well with hundreds of atoms using MP-0 pretrained model, but it only crashes with CUDA OOM from 10k number of atoms.
My questions are:
1) Is this CUDA OOM can be escaped by more GPUs? Or, is this CUDA OOM limited by training set & training condition (in this case the MACE-MP-0) so that I wouldn't escape from CUDA OOM crash with any LAMMPS MD job conditions?
2) If I fine-tune the MACE-MP-0 model with "big system" geometries, would that prevent CUDA OOM crash?
3) What would be the tip or hint to escape from CUDA OOM crash, in general?
Thanks