Closed wangvsa closed 1 year ago
Increasing spill size has no effect, but increasing shmem size or using more processes solves the issue. Seems to be a lack of memory problem.
Found the bug. Here's the call chain:
wrap_write() --> unifyfs_fd_write() --> unifyfs_fid_write() --> fid_logio_write() --> unifyfs_logio_alloc()
Application hangs at unifyfs_logio_alloc()
function trying to request lock (LOCK_LOG_HEADER(shmem_hdr)
)
In the case shmem does not have enough space for a write, UnifyFS writes partial data using any available shmem chunks and spills over the rest. However, shmem lock is not released after reserving the available shmem chunks.
Fix: Add UNLOCK_LOG_HEADER(shmem_hdr);
at line 605.
https://github.com/LLNL/UnifyFS/blob/165772106e1b78a8960a95b940dbee14adfba956/common/src/unifyfs_logio.c#L593-L608
Tried a few different runs, the fixed the issue.
System information
Describe the problem you're observing
I am trying a 2D LAMMPS example with UnifyFS. I started with a single node and 4 processes and configured it to dump a checkpoint every 20 iterations. LAMMPS supports a variety of I/O libraries. For now I'm trying POSIX I/O and MPI-IO. For MPI-IO, it uses collective I/O (MPI_File_write_at_all) and writes to a single file (dump.0.mpiio, dump.20.mpiio, dump.40.mpiio, ...)
It hangs at 40th iteration (3rd checkpoint) when using MPI-IO on a slightly larger domain size (which is ~600MB of each checkpoint). Similarly, if I reduce the checkpoint frequency to 10 iterations, then it hangs at 30th iteration (still the 3rd checkpoint). It works fine when using POSIX I/O with the same domain size.
I also traced it with Recorder, here's the list of functions it made (from a 32-process 100-iteration run):
Describe how to reproduce the problem
Here's the LAMMPS input configuration. Note the region box is where we set the domain size, larger than 4096x4096 it hangs.
Include any warning or errors or releveant debugging
I have set the output verbosity to 5. From the output, it seems all clients are still alive as they are still responding to the heartbeat RPC.
unifyfs_server_log.txt lammps_output_log.txt