LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
102 stars 31 forks source link

Lammps hangs at large domain sizes #762

Closed wangvsa closed 1 year ago

wangvsa commented 1 year ago

System information

Type Version/Name
Operating System Catalyst
Architecture intel/19.1.0;impi/2018.0; Lustre
UnifyFS Version dev branch

Describe the problem you're observing

I am trying a 2D LAMMPS example with UnifyFS. I started with a single node and 4 processes and configured it to dump a checkpoint every 20 iterations. LAMMPS supports a variety of I/O libraries. For now I'm trying POSIX I/O and MPI-IO. For MPI-IO, it uses collective I/O (MPI_File_write_at_all) and writes to a single file (dump.0.mpiio, dump.20.mpiio, dump.40.mpiio, ...)

It hangs at 40th iteration (3rd checkpoint) when using MPI-IO on a slightly larger domain size (which is ~600MB of each checkpoint). Similarly, if I reduce the checkpoint frequency to 10 iterations, then it hangs at 30th iteration (still the 3rd checkpoint). It works fine when using POSIX I/O with the same domain size.

I also traced it with Recorder, here's the list of functions it made (from a 32-process 100-iteration run):

Describe how to reproduce the problem

Here's the LAMMPS input configuration. Note the region box is where we set the domain size, larger than 4096x4096 it hangs.

# 2-d LJ flow simulation
dimension   2
boundary    p s p

atom_style  atomic
neighbor    0.3 bin
neigh_modify    delay 5

# create geometry
lattice     hex 0.7
region box block 0 4096 0 4096 -0.25 0.25
create_box  3 box
create_atoms    1 box

mass        1 1.0
mass        2 1.0
mass        3 1.0

# LJ potentials
pair_style  lj/cut 1.12246
pair_coeff  * * 1.0 1.0 1.12246

# define groups
region       1 block INF INF INF 1.25 INF INF
group        lower region 1
region       2 block INF INF 8.75 INF INF INF
group        upper region 2
group        boundary union lower upper
group        flow subtract all boundary

set      group lower type 2
set      group upper type 3

# initial velocities
compute      mobile flow temp
velocity     flow create 1.0 482748 temp mobile
fix      1 all nve
fix      2 flow temp/rescale 200 1.0 1.0 0.02 1.0
fix_modify   2 temp mobile

# Couette flow

#velocity     lower set 0.0 0.0 0.0
#velocity     upper set 3.0 0.0 0.0
#fix         3 boundary setforce 0.0 0.0 0.0
#fix         4 all enforce2d

# Poiseuille flow
velocity     boundary set 0.0 0.0 0.0
fix      3 lower setforce 0.0 0.0 0.0
fix      4 upper setforce 0.0 NULL 0.0
fix      5 upper aveforce 0.0 -1.0 0.0
fix      6 flow addforce 0.5 0.0 0.0
fix      7 all enforce2d

# Run
timestep    0.003
thermo      500
thermo_modify   temp mobile

# only dump coordinates to make sure it dumps the same amount of data as hdf5
#dump 2 all custom/adios 20 dump.*.bp x y z
dump 3 all custom/mpiio 20 dump.*.mpiio x y z   
#dump 4 all netcdf 20 dump.*.nc x y z
#dump 5 all h5md 20 dump_h5md.h5 position
#dump 5 all custom 20 dump.*.txt id type x y z
# Same as the above just gziped after dump
#dump 5 all custom 20 dump.*.gz id type x y z
#restart 20 poly.restart.mpiio
run 60

Include any warning or errors or releveant debugging

I have set the output verbosity to 5. From the output, it seems all clients are still alive as they are still responding to the heartbeat RPC.

unifyfs_server_log.txt lammps_output_log.txt

wangvsa commented 1 year ago

Increasing spill size has no effect, but increasing shmem size or using more processes solves the issue. Seems to be a lack of memory problem.

wangvsa commented 1 year ago

Found the bug. Here's the call chain: wrap_write() --> unifyfs_fd_write() --> unifyfs_fid_write() --> fid_logio_write() --> unifyfs_logio_alloc() Application hangs at unifyfs_logio_alloc() function trying to request lock (LOCK_LOG_HEADER(shmem_hdr))

In the case shmem does not have enough space for a write, UnifyFS writes partial data using any available shmem chunks and spills over the rest. However, shmem lock is not released after reserving the available shmem chunks.

Fix: Add UNLOCK_LOG_HEADER(shmem_hdr); at line 605. https://github.com/LLNL/UnifyFS/blob/165772106e1b78a8960a95b940dbee14adfba956/common/src/unifyfs_logio.c#L593-L608

wangvsa commented 1 year ago

Tried a few different runs, the PR fixed the issue.