ERROR: mismatch between number of ranks and number of available GPUs

gullbrekken commented 2 months ago

Hello

I want to simulate the included dataset using the BAMBOO force field. I compile the bamboo LAMMPS version with CUDA 12.1.1 and PyTorch 2.1.2. I am on a cluster with a variety of different GPUs, but there are several nodes with Nvidia A100 GPUs, so I choose the Ampere architecture. I change the build.sh file to:

-D CMAKE_CUDA_ARCHITECTURE=80 \
-D Kokkos_ARCH_PASCAL60=no \
-D Kokkos_ARCH_AMPERE80=yes \
-D GPU_ARCH=sm80 \

The compilation completes with some warnings, but no errors.

I try to run the included in.lammps file with this line in the slurm script: srun /cluster/home/oystegul/bamboo/pair/lammps/output/lmp -k on g 1 -sf kk -in in.lammps

I assign one A100 GPU in one node in my slurm script: #SBATCH --gres=gpu:a100:1 The node also has 64 CPUs, and I use all of them: #SBATCH --ntasks-per-node=64

I get this error when I try to run in.lammps:

ERROR: pair_bamboo: mismatch between number of ranks and number of available GPUs (src/pair_bamboo.cpp:67)
Last command: pair_style      bamboo 5.0 5.0 10.0 1

I have checked the device_count by print(torch.cuda.device_count()) and it is 1. This is the GPU used: NVIDIA A100-SXM4-80GB

Why I am getting this error?

Also, it would be great if you could explain the parameters used in the pair_style bamboo command in the documentation.

gullbrekken commented 2 months ago

Ok, after some more testing, I found out how to fix this. I had to set #SBATCH --ntasks-per-node=1 in the slurm script so that nprocs = 1. However, this means that only the GPU is used. I assume that this version of LAMMPS is optimized for running only on GPUs then, and adding CPUs into the mix would not be beneficial?

muzhenliang commented 2 months ago

Thank you for raising this issue. The primary bottleneck during Molecular Dynamics (MD) simulations is the model inference step, which consumes the majority of the runtime.

Given this, increasing the number of CPUs does not yield significant performance improvements. While you might observe a minor speedup with additional CPUs, the overall impact is minimal due to the inference step's dominance in the computational workload.

gullbrekken commented 2 months ago

Thank you for the clarification. It seems the current code can only run on one GPU, would it be possible to add support for several GPUs as that could further increase the simulation performance? (I might have added this as a separate feature request, but here goes..)

muzhenliang commented 2 months ago

Multiple GPUs are tested, but no significant speedup observed for the BAMBOO model. The primary cause is that the Graph Neural Network (GNN) structure is not well-suited for parallel inference. The current version of BAMBOO can handle 10,000+ atoms, which is sufficient for most research purposes.

gullbrekken commented 2 months ago

Ok, thank you for the answer.

bytedance / bamboo

ERROR: mismatch between number of ranks and number of available GPUs #8