Problems with parallelization on CPU (Using LAMMPS)

Greetings, I'm Marco Ravalli, a phD student working on molecular dynamics by classical interatomic potentials. I'm currently trying to use one of the pretrained MACE models to work with my systems, using LAMMPS as the main software for the MD calculations. At the moment, I'm having troubles using the ML-potentials: I've followed the instructions on the MACE site to implement MACE in LAMMPS, but when I try to run a simulation by OpenMPI (to parallelize) I get a Segmentation fault. My simulation facility works with CPUs. I do know that the implementation of CPU's is still experimental. When I run the simulation with just one node, it works just fine but (obviously) painfully slow (I'm already using the smallest pretrained model possible). I've tried many times to rebuild everything during the installation process, but I still get the same problem. I can attach inputs and the MACE model I used, but as everything works on just one node, I don't see how they could be the issue. As the simulation starts and works perfectly fine with just one core, I don't really know what the real problem is. If useful information is lacking, please let me know. Thank you in advance for your kind help. Best regards,

https://github.com/ACEsuit/mace/issues/487 I'm not MACE dev or anything but just a user like you. I saw seg fault, OOM, device-side assert triggered error, and even saw no output files generated even though simulation runs.

From what I learned, basic libtorch is not happy with mpi so you need to compile libtorch that can work well with mpi. Then use that libtorch to build lammps. I talked to our system guy and he told me he just cmake libtorch with mpicc and mpicxx of openmpi for c and cxx compiler.

For multiple CPU cores or multiple GPU, you should use pair_style mace, not pair_style mace no_domain_decomposition. But pair_style mace no_domain_decomposition will make a single CPU core or single GPU faster and efficient.

Also, I found memory per a single CPU core is important for multi-core or single-core, at least in Slurm system. I had to allocate at least 35G or more memory per a CPU core for my system for multi-core tests, but this depends on MD geometry (bigger system will ask more memory) and model's hyperparameters.

But even then, from our environment, 100 iterations for 512 atom system took 50000 seconds for 8 cores in a same node. This is not practical. I guess I will need good amount of CPU cores for 10k atoms from our evironment, but that will take infinity.

Only practical option (from our environment) is just 1 CPU core with pair_style mace no_domain_decomposition, but this will limit system size because of memory and still slower than GPU. So, I recommend using a single GPU with pair_style mace no_domain_decomposition. I can't run huge systems (20k atoms) I want due to the limitation of our cluster, but at least I can run something, less than 8k number of atoms. I heard other better clusters with multi-GPU nodes can run something in practical time frame with big systems with multi-GPU, but I don't know if that is true or not, or how much efficient.

But this is for our server cluster, everything depends on cluster environment. Maybe yours can be faster... I hope they publish parallel computing friendly version of LAMMPS-MACE soon, really soon. :)

ACEsuit / mace

Problems with parallelization on CPU (Using LAMMPS) #520