Parallel-NetCDF / PnetCDF

Source code repository of PnetCDF library and utilities
https://parallel-netcdf.github.io
Other
80 stars 22 forks source link

NC_SHARE: call sync after fill operations #107

Open adammoody opened 1 year ago

adammoody commented 1 year ago

While running PnetCDF tests on a file system that requires stricter adherence to the MPI-I/O consistency semantics (UnifyFS), I found that several test cases involving fill operations were not calling MPI_File_sync after the writes associated with the fill calls. This led to data inconsistency when other ranks later wrote different data to the same region of the file. This was true even after modifying the test to create the file with NC_SHARE set.

One example shows up when running test/testcases/flexible.c with 4 ranks on 1 node due to the fill calls here:

https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/flexible.c#L91-L97

Adding NC_SHARE to the ncmpi_create call does not help:

https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/flexible.c#L79

To get this test to pass, I had to add a call to ncmpi_sync after the fill calls.

The intent with this PR is to call MPI_File_sync/MPI_Barrier after writing data during a fill call if the NC_SHARE mode bit was set when creating the file.

I wanted to open the PR for discussion.

Are you open to a change like this?

If so, are there other locations where MPI_File_sync should be called during fills?

adammoody commented 1 year ago

I can confirm that this PR allows the test/testcases/flexible.c test to pass if one adds NC_SHARE to the create call.

Without that, the test produces errors like the following:

+ srun --overlap -n 4 -N 1 ./flexible /unifyfs/testfile.nc
Error at flexible.c line 201: expect buf[0][0]=11 but got -2147483647
Error at flexible.c line 201: expect buf[0][1]=11 but got -2147483647
Error at flexible.c line 201: expect buf[0][2]=11 but got -2147483647
Error at flexible.c line 201: expect buf[0][3]=11 but got -2147483647
Error at flexible.c line 201: expect buf[0][4]=11 but got -2147483647
Error at flexible.c line 201: expect buf[1][0]=10 but got -2147483647
Error at flexible.c line 201: expect buf[1][1]=10 but got -2147483647
Error at flexible.c line 201: expect buf[1][2]=10 but got -2147483647
Error at flexible.c line 201: expect buf[1][3]=10 but got -2147483647
Error at flexible.c line 201: expect buf[1][4]=10 but got -2147483647
Error at line 209 in flexible.c: (NC_ERANGE)
Error at flexible.c line 217: expect buf[0][0]=11 but got -2147483647
Error at flexible.c line 217: expect buf[0][1]=11 but got -2147483647
Error at flexible.c line 217: expect buf[0][2]=11 but got -2147483647
Error at flexible.c line 217: expect buf[0][3]=11 but got -2147483647
Error at flexible.c line 217: expect buf[0][4]=11 but got -2147483647
Error at flexible.c line 217: expect buf[1][0]=10 but got -2147483647
Error at flexible.c line 217: expect buf[1][1]=10 but got -2147483647
Error at flexible.c line 217: expect buf[1][2]=10 but got -2147483647
Error at flexible.c line 217: expect buf[1][3]=10 but got -2147483647
Error at flexible.c line 217: expect buf[1][4]=10 but got -2147483647
Error at line 246 in flexible.c: (NC_ERANGE)
Error at line 275 in flexible.c: (NC_ERANGE)
Error at line 276 in flexible.c: (NC_ERANGE)
*** TESTING C   flexible for flexible put and get                  ------ fail with 72 mismatches
wkliao commented 1 year ago

To get this test to pass, I had to add a call to ncmpi_sync after the fill calls.

I assume adding one ncmpi_sync call after all ncmpi_fill_var_rec calls will pass flexible.c on UnifyFS, instead of one for each ncmpi_fill_var_rec. Is this the case?

adammoody commented 1 year ago

To get this test to pass, I had to add a call to ncmpi_sync after the fill calls.

I assume adding one ncmpi_sync call after all ncmpi_fill_var_rec calls will pass flexible.c on UnifyFS, instead of one for each ncmpi_fill_var_rec. Is this the case?

Thanks @wkliao . Yes, that's right. A single ncmpi_sync call after all fill calls in this particular test case is sufficient.

There are a number of test cases that I've come across so far, e.g., https://github.com/Parallel-NetCDF/PnetCDF/commit/e47759a79ab7178329e300dd745260b2b2c1b0e7

wkliao commented 1 year ago

Thanks. That is a very useful information.

PnetCDF is built on top of MPI-IO, so follows the MPI-IO consistency semantics. When using a non-POSIX or non-UNIX compliant file system, users are required to call MPI_File_sync, MPI_Barrier, and MPI_File_sync (in that order) after each I/O call. Thus for PnetCDF users, I suggest to add ncmpi_file_sync, MPI_Barrier, and ncmpi_file_sync after the last call to ncmpi_fill_var_rec.

Your patch will affect all applications that are using POSIX compliant file systems and ncmpi_file_sync is very expensive call. I wonder if there is a alternative solution or one from UnifyFS side.

roblatham00 commented 1 year ago

How common is NC_SHARE? it could be a nice shorthand for "try really hard to keep this in sync across processes" ?

another option would be to consult the romio_visibility_immediate hint. see https://github.com/pmodels/mpich/pull/6148 and https://github.com/pmodels/mpich/issues/5902 for some recent Unify-inspired enhancements to ROMIO

ROMIO is not the only MPI-IO implementation, so it does feel wrong to have PnetCDF rely on a non-standard hint like this.

@adammoody where did we leave https://github.com/roblatham00/mpich/tree/dev-unify ? To wei-keng's point, trying to use a POSIX-assuming ROMIO driver on a non-posix storage system is going to get you into trouble, but the goal with the Unify ROMIO driver was Unify can (cheaply) sync and flush if needed, when needed.

adammoody commented 1 year ago

I figured the change in this PR aligns well with the PnetCDF data consistency doc:

https://github.com/Parallel-NetCDF/PnetCDF/blob/master/doc/README.consistency.md#note-on-parallel-io-data-consistency

If users would like PnetCDF to enforce a stronger consistency, they should add NC_SHARE flag when open/create the file. By doing so, PnetCDF adds MPI_File_sync() after each MPI I/O calls.

  • For PnetCDF collective APIs, an MPI_Barrier() will also be called right after MPI_File_sync().
  • For independent APIs, there is no need for calling MPI_Barrier(). Users are warned that the I/O performance when using NC_SHARE flag could become significantly slower than not using it.

It seems like most calls that lead to an MPI_File_write are followed by MPI_File_sync when NC_SHARE is set, so I wondered whether the fill calls could be included in that set.

adammoody commented 1 year ago

By the way, I think it would help to add an empty line on that page to fix the github markdown display:

diff --git a/doc/README.consistency.md b/doc/README.consistency.md
index cdbf42f8..56a44587 100644
--- a/doc/README.consistency.md
+++ b/doc/README.consistency.md
@@ -15,6 +15,7 @@ MPI_File_sync() after each MPI I/O calls.
   * For PnetCDF collective APIs, an MPI_Barrier() will also be called right
     after MPI_File_sync().
   * For independent APIs, there is no need for calling MPI_Barrier().
+
 Users are warned that the I/O performance when using NC_SHARE flag could become
 significantly slower than not using it.

As it is, the "Users are warned" statement gets attached to the second bullet rather than standing on its own.

adammoody commented 1 year ago

@adammoody where did we leave https://github.com/roblatham00/mpich/tree/dev-unify ? To wei-keng's point, trying to use a POSIX-assuming ROMIO driver on a non-posix storage system is going to get you into trouble, but the goal with the Unify ROMIO driver was Unify can (cheaply) sync and flush if needed, when needed.

Thanks @roblatham00 . Right, the MPICH dev-unify branch would also resolve this, though I haven't tried it yet. I have been testing against our default system MPI libraries to get a baseline. For that, I'm taking extra steps like disabling ROMIO data sieving and other aspects that are known to be problematic for UnifyFS.

wkliao commented 1 year ago

NC_SHARE is a flag inherited from NetCDF. NetCDF before version 4 supported only serial I/O. This flag was designed for the situation when there are two or more (serial) applications accessing to the same file. When PnetCDF added the parallel I/O feature we would like to follow the MPI-IO consistency semantics, which are only defined for a single MPI application. Therefore, NC_SHARE is no longer valid in MPI-IO realm and keeping it can be misleading.

It is in my plan to retire this flag in the future and leave the users to enforce the desired consistency, i.e. by calling sync-barrier-sync or setting an MPI-IO hint to turn on/off a feature underneath PnetCDF.

Requiring PnetCDF and UnifyFS users to always set NC_SHARE (and hence calling sync and barrier in every write operation) is not an ideal solution, especially when the application does not overwrite each other's data after fill. A better solution would be for application program itself to just add a sync-barrier-sync after all the fill calls, so the remaining writes can still enjoy a high performance.

wkliao commented 1 year ago

Hi, @adammoody Can you please provide a complete list of test programs that failed? I plan to add a sync-barrier-sync after the fill calls.

adammoody commented 1 year ago

Hi, @adammoody Can you please provide a complete list of test programs that failed? I plan to add a sync-barrier-sync after the fill calls.

Sure. I'm still working through the different test cases. I'll let you know which tests I had to modify when I'm done.

adammoody commented 1 year ago

Here is the latest list of changes to the test cases:

https://github.com/Parallel-NetCDF/PnetCDF/compare/master...adammoody:PnetCDF:unifyfs

In some tests, I may have added more ncmpi_sync() calls than strictly necessary. Also, my branch mixes in some unrelated changes that I was working on, so just consider the changes involving ncmpi_sync().

@wangvsa, is using Recorder and VerifyIO to develop a more comprehensive list of test cases.

wkliao commented 6 months ago

PR #120 has been created to add the following code fragment into the test programs reported failed on UnifyFS.

    ncmpi_sync(ncid);
    MPI_Barrier(MPI_COMM_WORLD);
    ncmpi_sync(ncid);

Please give it a try and let me know if you found more test programs need the similar fix.

FYI. The git command to fetch PR #120 is:

git fetch origin pull/120/head:unifyfs
git checkout unifyfs