MPI collective I/O and UnifyFS

adammoody commented 1 year ago

With the collective write calls in MPI I/O, the MPI library may rearrange data among processes to write to the underlying file more efficiently, as is done in ROMIO's collective buffering. The user does not know which process actually writes to the file, even if they know which process provides the source data and file offset to be written.

An application may be written such that a given process writes twice to the same file offset using collective write calls. Since the same process writes to the same offset, the MPI standard does not require the application to call MPI_File_sync() between those writes. However, depending on the MPI implementation, those actual writes may happen from two different processes.

As an example taken from PnetCDF, it is common to set default values for variables in a file using fill calls and then later write actual data to those variables. The fill calls use collective I/O, whereas the later write call may not. In this case, two different processes can write to the same file offset, one process with the fill value, and a second process with the actual data. In UnifyFS, these two writes need to be separated with a sync-barrier-sync to establish an order between them.

It may be necessary to ask users to do at least one of the following:

set UNIFYFS_CLIENT_WRITE_SYNC=1 if using collective write calls (one might still need a barrier after all syncs)
call MPI_File_sync() + MPI_Barrier() after any collective write call
disable ROMIO's collective buffering feature

Need to review the MPI standard:

I don't recall of the top of my head what the standard says about MPI_File_sync in the case that the application knowingly writes to the same file offset from two different ranks using two collective write calls. Is MPI_File_sync needed in between or not?
I'm pretty sure that MPI_File_sync is not required when the same process writes to the same offset in two different write calls.

Regardless, I suspect very few applications currently call MPI_File_sync in either situation. Even if the standard requires it, we need to call this out.

The UnifyFS-enabled ROMIO could sync extents and then call barrier on its collective write calls. This would ensure all writes are visible upon returning from the collective write.

wangvsa commented 1 year ago

I happen to have this information since my current paper talks about MPI consistency model.

The MPI standard provides three levels of consistency:

sequential consistency among all accesses using a single file handle. (e.g., only one process accesses the file)
sequential consistency among all accesses using file handles created from a single collective open with atomic mode enabled
user-imposed consistency among accesses other than the above.

So here we should be only worrying about the third case. In this case, MPI requires a sync-barrier-sync construct between the conflicting writes (from different processes). The construct can be one of the following:

MPI_File_sync-->MPI_File_sync
MPI_File_sync-->MPI_File_open
MPI_File_close->MPI_File_sync
MPI_File_close->MPI_File_open

adammoody commented 1 year ago

Thanks @wangvsa . So then the app should have the sync-barrier-sync for the first situation above (1) - two different procs, but it's not required in (2) - same proc. I'm guessing most apps don't have it in either case, and UnifyFS might actually need it for both to work properly.

wangvsa commented 1 year ago

The apps themself rarely overwrite the same offsite (they rarely perform two collective calls on the same range). It is more likely the high-level libraries doing. E.g., HDF5 uses collective I/O to update its metadata. This is still not common though, I have tested several apps using HDF5 and they don't seem to have a consistency issue. I remember I checked HDF5's source code a while ago, and it seems to have adequate MPI_File_sync calls.

adammoody commented 1 year ago

Right, hopefully it's not too common, and based on your earlier research we have some confidence in that. A few of the PnetCDF tests I've been running do encounter this kind of condition.

The fill call here conflicts with the put (write) calls later in the program: https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/largefile/high_dim_var.c#L95

The test case reports data corruption under UnifyFS, because on read back, it finds the fill value rather than the expected data. When running with 2 processes, one process writes the fill data and the other writes the actual data.

The fill call here doesn't specify any kind of offset, so in this case, we could argue the PnetCDF user probably should call ncmpi_sync() between the fill call and the later write calls in order to be compliant with the MPI standard. Alternatively, the PnetCDF library itself could be modified to call MPI_File_sync() before it returns from the fill call so that user doesn't have to worry about it. Subsequent writes might conflict, and it's hard for the PnetCDF user to know, since they often don't deal with file offsets directly.

However, this got me thinking about potential problems with MPI collective I/O more generally.

Edit: Actually, on closer inspection, only rank 0 issues put (write) calls in this particular test case. I think the actual problem is that ranks try to read from the file before any earlier writes have been sync'd. The file should have been closed or sync'd before trying to read back data, I think even by PnetCDF semantics. So perhaps this test case is not really valid.

adammoody commented 1 year ago

A second example from PnetCDF is the ncmpi_enddef call here, which writes fill values to the file: https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/tst_def_var_fill.c#L62

Later put calls conflict with that fill operation, and the test reports data corruption when using 2 ranks.

A workaround is to call ncmpi_sync() after the ncmpi_enddef() call and before the put calls.

adammoody commented 1 year ago

While I'm at it, here are two other test cases I've found so far:

fill calls conflict with later puts: https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/ivarn.c#L211-L218

implicit fill during enddef and later explicit fill call conflict with later put calls: https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/nonblocking/mcoll_perf.c#L512 https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/nonblocking/mcoll_perf.c#L521

wangvsa commented 1 year ago

According to the pnetcdf document, "PnetCDF follows the same parallel I/O data consistency as MPI-IO standard". If this is the case, they should either set the atomic mode when opening an MPI File, or put enough sync-barrier-sync. Otherwise, I would argue they have consistency issues in their implementation, not just their test cases are invalid.

adammoody commented 1 year ago

The default mode of PnetCDF intentionally does not call MPI_File_sync everywhere since it can be expensive and is not needed on all file systems. I think the NC_SHARE mode is meant to help force things, but it doesn't always work. PnetCDF notes that this calls MPI_File_sync in more cases, but the documentation is not clear about which cases are covered.

https://github.com/Parallel-NetCDF/PnetCDF/blob/master/doc/README.consistency.md#note-on-parallel-io-data-consistency

PnetCDF follows the same parallel I/O data consistency as MPI-IO standard. Refer the URL below for more information. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node296.htm#Node296

Readers are also referred to the following paper. Rajeev Thakur, William Gropp, and Ewing Lusk, On Implementing MPI-IO Portably and with High Performance, in the Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems, pp. 23-32, May 1999.

If users would like PnetCDF to enforce a stronger consistency, they should add NC_SHARE flag when open/create the file. By doing so, PnetCDF adds MPI_File_sync() after each MPI I/O calls.

For PnetCDF collective APIs, an MPI_Barrier() will also be called right after MPI_File_sync().

For independent APIs, there is no need for calling MPI_Barrier(). Users are warned that the I/O performance when using NC_SHARE flag could become significantly slower than not using it.

If NC_SHARE is not set, then users are responsible for their desired data consistency. To enforce a stronger consistency, users can explicitly call ncmpi_sync(). In ncmpi_sync(), MPI_File_sync() and MPI_Barrier() are called.

I did find this in the release notes for v1.2.0:

https://parallel-netcdf.github.io/wiki/NewsArchive.html

Data consistency control has been revised. A more strict consistency can be enforced by using NC_SHARE mode at the file open/create time. In this mode, the file header is synchronized to the file if its contents have changed. Such file synchronization of calling MPI_File_sync() happens in many places, including ncmpi_enddef(), ncmpi_redef(), all APIs that change global or variable attributes, dimensions, and number of records.

As calling MPI_File_sync() is very expensive on many file systems, users can choose more relaxed data consistency, i.e. by not using NC_SHARE. In this case, file header is synchronized among all processes in memories. No MPI_File_sync() will be called if header contents have changed. MPI_File_sync() will only be called when switching data mode, i.e ncmpi_begin_indep_data() and ncmpi_end_indep_data().

Setting NC_SHARE helps in some of the test cases that are currently failing, but ivarn.c still fails with 2 ranks on one node, in this case due to the fill calls and subsequent put calls. It seems like it would be helpful to call MPI_File_sync after fill calls when NC_SHARE is set. I think that would fix the failing ivarn.c test case.

This does not directly apply, but I'll just stash this URL about NC_SHARE and nc_sync() from NetCDF (not PnetCDF) for future reference.

https://docs.unidata.ucar.edu/netcdf-c/current/group__datasets.html#gaf2d184214ce7a55b0a50514749221245

adammoody commented 1 year ago

I opened a PR for a discussion with the PnetCDF team about calling MPI_File_sync after fill calls when NC_SHARE is set.

https://github.com/Parallel-NetCDF/PnetCDF/pull/107

wangvsa commented 11 months ago

@adammoody I'm trying to reproduce these conflicts. Which system and MPI implementation were you using?

adammoody commented 11 months ago

I did most of the work on quartz, which uses MVAPICH2 as a system MPI library. Actually, I was using a debug build of MVAPICH so that I could trace into the MPI code. I'll send you the steps in an email on how I set things up.

wangvsa commented 11 months ago

I just tried ivarn and tst_def_var_fill using OpenMPI and mpich. They don't show any conflict on my side, all I/O calls are done internally using MPI_File_write_at_all (eventually only rank 0 does the pwrite()).

LLNL / UnifyFS

MPI collective I/O and UnifyFS #781