ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
415 stars 157 forks source link

multi-gpu lammps issue #322

Open hwsheng opened 4 months ago

hwsheng commented 4 months ago

I'm encountering difficulties with running a multi-GPU simulation in LAMMPS using the MACE model. In a preliminary test using two GPUs, I executed the simulation with the following command: mpirun -np 2 ~/lammps-mace-gpu/lammps/build-kokkos-cuda/lmp -in lmp.in -k on g 2 -sf kk. However, I ran into an error stating cudaFree(arg_alloc_ptr) error(cudaErrorAssert): device-side assert triggered.

Would you have any advice on how to address this problem? Thank you in advance.

wcwitt commented 4 months ago

Can you paste your input file?

hwsheng commented 4 months ago

Thanks for your attention. Here is the input of my lammps-mace simulation, which runs well in a single GPU execution.

# Test of MACE potential for C system

units           metal
boundary        p p p

atom_style      atomic
atom_modify map yes
newton on

read_data       C.dat

mass            1 12.011

pair_style mace no_domain_decomposition
pair_coeff * * ../carbon_swa.model-lammps.pt C

velocity all create 10000 4928459 rot yes dist gaussian

fix             1 all npt temp 6300 300 0.2  iso 10000 10000 0.5
thermo          100
timestep        0.002
dump            dump all custom 10000 dump.dat id type xu yu zu
run             600000
unfix 1
fix             1 all npt temp 300 300 0.2  iso 0 0 0.5
run             100000
wcwitt commented 4 months ago

The no_domain_decomposition only works on a single GPU, so you need

pair_style mace

instead.

This isn't very well documented, sorry. Please note that, right now, a single-GPU no_domain_decomposition simulation will almost certainly be faster than a multi-GPU simulation. I don't recommend using multi-GPU unless you absolutely need it (e.g., for memory). We are working on this.

hwsheng commented 4 months ago

Thanks for the heads-up. Indeed, I was trying to resolve the out-of-memory issue encountered in the single-gpu simulation when increasing the number of atoms in the simulation system.

Now, for a test run using two GPUs,

After using

pair_style mace 

it turns out that I got an out-of-memory error RuntimeError: CUDA out of memory. Tried to allocate 7.39 GiB (GPU 1; 79.15 GiB total capacity; 65.36 GiB already allocated; 5.00 GiB free; 72.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This error was not shown in a single GPU simulation (same system size, 4086 atoms).

Guess I have to stick to a small system size for the simulation for now?

Thanks in advance.

wcwitt commented 4 months ago

For single species, on our A100 (80GB memory), I'd normally expect to reach system sizes of 5000-10000 before seeing memory problems, depending on how expressive the model is (L=0, L=1, L=2, etc). So you may be able to reach larger systems on a single GPU by reducing your model size.

It's also possible, but not guaranteed, that increasing to four GPUs (say) would be enough. But this wouldn't be my first choice if you can avoid it.

hwsheng commented 4 months ago

ok. Many thanks for your advice. I will try that.