Open Entropy-Enthalpy opened 3 weeks ago
I found a similar issue with the PyTorch
backend, but only GPU_0
's VRAM was "wasted".
For a 8-GPU job, like this:
source: v3.0.0b4-17-g8174cf11
source branch: devel
source commit: 8174cf11
source commit at: 2024-10-11 03:20:55 +0000
Lammps 29Aug2024 update1
PyTorch 2.4.1 cuDNN 9.3.0 NVHPC 24.5 (nompi) OpenMPI 5.0.5 (CUDA-Aware) UCX 1.17.0 (CUDA + GDRCopy)
For PyTorch, I guess c10::cuda::set_device
should work. This API is not documented, though.
related discussion: https://discuss.pytorch.org/t/cuda-extension-with-multiple-gpus/160053/6
For PyTorch, I guess
c10::cuda::set_device
should work. This API is not documented, though.related discussion: https://discuss.pytorch.org/t/cuda-extension-with-multiple-gpus/160053/6
As a user, I just know that source/api_cc/src/DeepPotPT.cc
might need to be modified, but I don't know how... 🥺
Bug summary
I have been using DP for a long time, and in every version I have used, I have encountered this issue: when running a Lammps MD simulation using multiple GPUs via
mpirun
, each MPI Rank consumes VRAM on all GPUs, even though the computation of each MPI Rank is actually running on only one GPU.For example, in the picture below, I requested 4 V100-SXM2-16GB GPUs for a single MD job and started 4 MPI Ranks. In reality, each GPU has (4-1)0.3=0.9GiB of VRAM "wasted". For an 8-GPU job, this would "waste" (8-1)0.3=2.1GiB of VRAM. If MPS is used, the "wasted" VRAM would be doubled.
On the surface, it seems that this issue arises because the TensorFlow
gpu_device
runtime executes a "create device" operation for each GPU in every MPI Rank (as can be seen in the logs), but I don't know how to avoid this problem. It is noteworthy that TensorFlow "can't see" the GPUs on different nodes, so when running Lammps MD across multiple nodes and each node uses only one GPU, there is no such issue.DeePMD-kit Version
3.0.0b4
Backend and its version
TensorFlow v2.15.2, Lammps 29Aug2024
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Running Commands:
mpirun -np 4 lmp_mpi -in input.lammps
Part of Log:
Steps to Reproduce
N/A
Further Information, Files, and Links
No response