Closed gullbrekken closed 2 months ago
Ok, after some more testing, I found out how to fix this. I had to set #SBATCH --ntasks-per-node=1
in the slurm script so that nprocs = 1. However, this means that only the GPU is used. I assume that this version of LAMMPS is optimized for running only on GPUs then, and adding CPUs into the mix would not be beneficial?
Thank you for raising this issue. The primary bottleneck during Molecular Dynamics (MD) simulations is the model inference step, which consumes the majority of the runtime.
Given this, increasing the number of CPUs does not yield significant performance improvements. While you might observe a minor speedup with additional CPUs, the overall impact is minimal due to the inference step's dominance in the computational workload.
Thank you for the clarification. It seems the current code can only run on one GPU, would it be possible to add support for several GPUs as that could further increase the simulation performance? (I might have added this as a separate feature request, but here goes..)
Multiple GPUs are tested, but no significant speedup observed for the BAMBOO model. The primary cause is that the Graph Neural Network (GNN) structure is not well-suited for parallel inference. The current version of BAMBOO can handle 10,000+ atoms, which is sufficient for most research purposes.
Ok, thank you for the answer.
Hello
I want to simulate the included dataset using the BAMBOO force field. I compile the bamboo LAMMPS version with CUDA 12.1.1 and PyTorch 2.1.2. I am on a cluster with a variety of different GPUs, but there are several nodes with Nvidia A100 GPUs, so I choose the Ampere architecture. I change the build.sh file to:
The compilation completes with some warnings, but no errors.
I try to run the included in.lammps file with this line in the slurm script:
srun /cluster/home/oystegul/bamboo/pair/lammps/output/lmp -k on g 1 -sf kk -in in.lammps
I assign one A100 GPU in one node in my slurm script:
#SBATCH --gres=gpu:a100:1
The node also has 64 CPUs, and I use all of them:#SBATCH --ntasks-per-node=64
I get this error when I try to run in.lammps:
I have checked the device_count by
print(torch.cuda.device_count())
and it is 1. This is the GPU used: NVIDIA A100-SXM4-80GBWhy I am getting this error?
Also, it would be great if you could explain the parameters used in the pair_style bamboo command in the documentation.