LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
102 stars 31 forks source link

Problem in Unify/MPI-IO reading data across a "boundary" between two previously written ranges in a file #560

Open clmendes opened 3 years ago

clmendes commented 3 years ago

@adammoody , @kathrynmohror : After doing extensive testing with one of the failing parallel HDF5 codes from their testsuite (cchunk5), I think I finally understand why it is failing, although I haven't looked yet at how to fix it in Unify. I used Recorder (without Unify) to see exactly what that HDF5 program was doing for I/O, and then, correlating the traces to the logs from a Unify-based execution, I can see where it goes wrong with Unify (my Unify version is based on Mike's new margotree branch). Based on that, I created this reproducer program below, which only uses MPI-IO (i.e. no HDF5), as I think this will make it easier to look for a fix.

The example below must be run in 4 processors. Running without Unify, file "datafile-m" will be created, with 400 bytes. Running under Unify, file "datafile-u", which also has 400 bytes, will be created. A byte-by-byte comparison between those two files will show differences between bytes 101:107 across the two files. This is an error similar to what I saw with that HDF5 test.

In this code, each rank initially writes 100 bytes to disjoint locations in the file. Then rank-0 tries to read 12 bytes starting at offset=95. This would have to read across two ranges, one range [0:99] written previously by rank-0, and the range [100:199] previously written by rank-1. The read is done with function MPI_File_read_at.

After the MPI_File_read_at call, the number of bytes actually read is retrieved in status.count_lo (according to the MPI_Status structure of MPI). Without Unify, that number is 12. However, with Unify, that number is 5 (i.e. the amount of bytes between 95 and 99). Those values (12 and 5) are printed by the executions of the program in stdout.

This is the test code (to be compiled for Unify, it must use -DUNIFY):

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#ifdef UNIFY
#include <unifyfs.h>
char *filename="ufs:/unifyfs/datafile-u" ;
#else
char *filename="datafile-m";
#endif

#define SIZE 100

int main(int argc, char **argv)
{
    int i, myrank, nranks, ret, count, nread;
    char buf[SIZE];
    MPI_File fh;
    MPI_Status status;
    MPI_Info info=MPI_INFO_NULL;
    MPI_Offset offset;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
#ifdef UNIFY
    ret = unifyfs_mount("/unifyfs", myrank, nranks, 0);
    if (ret) {
        printf("[%d] unifyfs_mount failed (return = %d)\n", myrank, ret);
        MPI_Abort(MPI_COMM_WORLD, 1);
    }
#endif

    if (nranks != 4) {
        fprintf(stderr, "Run this program on 4 processes\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    for (i=0; i<SIZE; i++) buf[i]=(char) (10+myrank);

    MPI_Barrier(MPI_COMM_WORLD);

    MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_RDWR, info, &fh);

    if (myrank==0) offset=0;
    if (myrank==1) offset=100;
    if (myrank==2) offset=200;
    if (myrank==3) offset=300;
    count = 100;
    MPI_File_write_at_all (fh, offset, buf, count, MPI_BYTE, &status);

    MPI_Barrier(MPI_COMM_WORLD);

    if(myrank==0) {
        for (i=0; i<SIZE; i++) buf[i] = 0;
        offset=95; count=12;
        MPI_File_read_at(fh, offset, buf, count, MPI_BYTE, &status);
        nread = status.count_lo;
        for (i=0; i<nread; i++) buf[i] += 50;
        MPI_File_write_at(fh, offset, buf, count, MPI_BYTE, &status);
    }

    MPI_Barrier(MPI_COMM_WORLD);
    MPI_File_close(&fh);

    if (myrank==0)  printf("All done, number_read=%d\n",nread);

#ifdef UNIFY
    unifyfs_unmount();
#endif
    MPI_Finalize();
    return 0;
}

Running it with Unify, on 4 processors in 1node, I see the following log records related to rank-0:

. . . unifyfs_gfid_read_reqs() [unifyfs.c:1090] read: offset:95, len:12 invoke_client_read_rpc() [margo_client.c:595] invoking the read rpc function in client invoke_client_read_rpc() [margo_client.c:604] Got response ret=0 process_read_data() [unifyfs.c:669] processing data response from server: [0] (gfid=1953131809, offset=95, length=5, errcode=0) process_read_data() [unifyfs.c:753] copied data to application buffer (5 bytes) delegator_signal() [unifyfs.c:599] receive buffer now empty unifyfs_gfid_read_reqs() [unifyfs.c:1150] fetched all data from server for 1 requests invoke_client_filesize_rpc() [margo_client.c:415] invoking the filesize rpc function in client invoke_client_filesize_rpc() [margo_client.c:424] Got response ret=0 . . . unifyfs_gfid_read_reqs() [unifyfs.c:1090] read: offset:100, len:7 invoke_client_read_rpc() [margo_client.c:595] invoking the read rpc function in client invoke_client_read_rpc() [margo_client.c:604] Got response ret=61 invoke_client_filesize_rpc() [margo_client.c:415] invoking the filesize rpc function in client . . .

Thus, I can see that, indeed, only 5 bytes were "copied to application buffer", so this must be the reason that status.count_lo in our program returned 5, instead of the expected 12 like in the non-Unify execution!

clmendes commented 3 years ago

A few more additional comments on this problem, based on some more tests done today. I got the same error using the dev version of Unify from about two months ago. So, this error is not exclusive of the new margotree branch by @MichaelBrim

And today I modified slightly the example, such that the same rank (rank-0) does all the writes to the file, still writing 100 bytes at a time. In this case, the example works correctly under Unify, and the number of bytes read by the call to MPI_File_read_at is indeed 12 as in the non-Unify version. Thus, somehow it looks like the problem only arises when different processors write to the neighbor ranges of the file. When those ranges are written by the same processor, it seems to work OK. In both cases, I am running with one node (i.e. one Unify server).

MichaelBrim commented 3 years ago

Without an intervening fsync(), a client cannot see the bytes written by another client. So in this example, Unify is behaving exactly as I would expect it to.

adammoody commented 3 years ago

@clmendes , you may need to set CLIENT_LOCAL_EXTENTS=0 after all. I see rank 0 will overwrite bytes written by another rank, and this optimization isn't valid in that case.

For the fsync(), try adding the sync-barrier-sync construct between the write_at_all and read_at:

MPI_File_sync(fh);
MPI_Barrier(MPI_COMM_WORLD);
MPI_File_sync(fh);

You may not really need both sync calls there, but that's what MPI suggests in one of its consistency modes.

adammoody commented 3 years ago

On a related topic, I know I need to re-read section 13.6 here: https://www.mpi-forum.org/docs/drafts/mpi-2019-draft-report.pdf

We'll need to figure out which MPI data consistency modes we support and which we do not.

adammoody commented 3 years ago

@clmendes , nice work to debug this with Recorder and the logs so far and then to create a simple reproducer.

clmendes commented 3 years ago

@kathrynmohror , as noted by @adammoody and @MichaelBrim above, I added a call to MPI_File_sync() in the example, between the calls to MPI_File_write_at_all and MPI_Barrier. With that, the code under Unify now works correctly as expected: the number of bytes read from the file is 12.

I also worked with the HDF5 test that was producing differences in the produced file when running under Unify. However, in that example, applying this type of syncing had two issues:

1) The test is an HDF5 program, hence it does not have direct calls to MPI-IO, or an "fh" file handle such that I could pass as an argument to MPI_File_sync. Instead, I inserted a call to H5Fflush(file, H5F_SCOPE_LOCAL ).

2) The point in the code where to insert this call is also tricky, since we do not clearly see the "writes" in the code. By looking at the traves from Recorder, I found that call to H5Dcreate2 does trigger some critical writes, hence I inserted the H5Fflush just after that, and prior to the calls to other HDF5 functions, including H5Dwrite (which also causes some writes to be invoked),

With these two considerations above, so far 5 of the failing tests are passing: cchunk5/cchunk6/cchunk7/cchunk8/cchunk9! I still need to see if this makes other tests to pass too.

adammoody commented 3 years ago

@clmendes , nice work finding a work around! It sounds like we should raise point 2 with HDF.

Also, I have only quickly skimmed the MPI I/O text, but I think the standard may say that it requires those calls to MPI_File_sync unless MPI_FILE_SET_ATOMICITY has been set to true. The text I'm looking at is here:

Assume that A1 and A2 are two data access operations. Let D1 (D2) be the set of absolute byte displacements of every byte accessed in A1 (A2). The two data accesses overlap if D1∩D2 != ∅. The two data accesses conflict if they overlap and at least one is a write access.

Let SEQfh be a sequence of file operations on a single file handle, bracketed by MPI_FILE_SYNCs on that file handle. (Both opening and closing a file implicitly perform an MPI_FILE_SYNC.) SEQfh is a “write sequence” if any of the data access operations in the sequence are writes or if any of the file manipulation operations in the sequence change the state of the file (e.g., MPI_FILE_SET_SIZE or MPI_FILE_PREALLOCATE). Given two sequences, SEQ1 and SEQ2, we say they are not concurrent if one sequence is guaranteed to completely precede the other (temporally).

Case 2: fh1a ∈ FH1 and fh1b ∈ FH1 Assume A1 is a data access operation using fh1a, and A2 is a data access operation using fh1b. If for any access A1, there is no access A2 that conflicts with A1, then MPI guarantees sequential consistency. However, unlike POSIX semantics, the default MPI semantics for conflicting accesses do not guarantee sequential consistency. If A1 and A2 conflict, sequential consistency can be guaranteed by either enabling atomic mode via the MPI_FILE_SET_ATOMICITY routine, or meeting the condition described in Case 3 below.

Sequential consistency is guaranteed among accesses to a single file if for any write sequence SEQ1 to the file, there is no sequence SEQ2 to the file which is concurrent with SEQ1. To guarantee sequential consistency when there are write sequences, MPI_FILE_SYNC must be used together with a mechanism that guarantees nonconcurrency of the sequences.

So we should keep chasing this down to verify whether it is valid MPI I/O according to the standard. If HDF is using that pattern without the sync calls, let's see if the HDF code has enabled atomicity or otherwise get their interpretation of what the MPI standard says here.

adammoody commented 3 years ago

To enable atomicity from HDF: https://support.hdfgroup.org/HDF5/doc1.8/RM/RM_H5F.html#File-SetMpiAtomicity

Also this is the MPI call: https://www.open-mpi.org/doc/v4.0/man3/MPI_File_set_atomicity.3.php

clmendes commented 3 years ago

Quincey, from HDF5, recommends that we try their atomicity call: https://support.hdfgroup.org/HDF5/doc1.8/RM/RM_H5F.html#File-SetMpiAtomicity This could make the code slower, but it would be equivalent to a sync/barrier/sync and would make operations safer, and would be a replacement to having to insert the H5Fflush calls.

adammoody commented 3 years ago

It looks like ROMIO normally wants to lock a file region when writing with atomicity enabled, though it is skipped for some file systems.

This code is from mvapich2-2.3rc2/src/mpi/romio/mpi-io/write.c:

        /* if atomic mode requested, lock (exclusive) the region, because
           there could be a concurrent noncontiguous request. Locking doesn't
           work on PIOFS and PVFS, and on NFS it is done in the
           ADIO_WriteContig.
         */

        if ((adio_fh->atomicity) && ADIO_Feature(adio_fh, ADIO_LOCKS))
        {
            ADIOI_WRITE_LOCK(adio_fh, off, SEEK_SET, bufsize);
        }

        ADIO_WriteContig(adio_fh, xbuf, count, datatype, file_ptr_type,
                     off, status, &error_code);

        if ((adio_fh->atomicity) && ADIO_Feature(adio_fh, ADIO_LOCKS))
        {
            ADIOI_UNLOCK(adio_fh, off, SEEK_SET, bufsize);
        }

We'll want to dig this kind of code from the ROMIO side in more detail to figure out if it actually requires locking and in what cases.

clmendes commented 3 years ago

I tested the example above today with the insertion of this call after the MPI_File_open:

MPI_File_set_atomicity(fh, 1);

The behavior did not change: it still reads only 5 bytes from the file (i.e. from 95 to 99). Thus, the setting of atomicity did not help.

Likewise, the insertion of the corresponding HDF5 call ( H5Fset_mpi_atomicity) did not fix the problem for the cchunk5 test, and it still produces wrong data in the produced file. This had been fixed with the insertion of a H5Fflush call (which produces a call to MPI_File_sync).

adammoody commented 3 years ago

Thanks @clmendes . Yes, based on the code snippet above, I think ROMIO is expecting locking to work in order to support correct behavior for atomicity. I suspect we have two options here:

1) Add support for file locking in UnifyFS. This is probably non-trivial, especially when dealing with clients that can fail/exit while still holding a lock. Locking will significantly reduce I/O performance. However, this is at least the second time where we've seen a code trying to using file locking, and it won't be the last. We probably should have something that people can use, even if performance is bad.

2) Work with the HDF group to get MPI_File_sync calls added in all of the necessary spots within the HDF code base. This also sounds like a lot of work, but maybe something they have a lead on already.

We probably should aim for both. Supporting 1) in UnifyFS will help us offer something to codes that need locking. Having 2) would provide correct behavior and much higher performance for HDF users on UnifyFS than file locking.

adammoody commented 3 years ago

This PR provides a work around by having unifyfs internally sync to the server after every write: https://github.com/LLNL/UnifyFS/pull/569

It helps with this particular reproducer.

adammoody commented 3 years ago

I started to skim the pnetcdf code with an eye to this same kind of issue.

https://github.com/Parallel-NetCDF/PnetCDF

First, to force MPI_File_sync to be called after MPI_File_write operations, it looks like one should open the file with NC_SHARE option. Here are some pointers in case it's helpful later.

The call that eventually calls MPI_File_write_at_all: https://github.com/Parallel-NetCDF/PnetCDF/blob/bdfe3479ab5711aa4a1e0c0e3d8d73c559f15c18/src/drivers/ncmpio/ncmpio_vard.c#L314

The call to MPI_File_write_at_all: https://github.com/Parallel-NetCDF/PnetCDF/blob/bdfe3479ab5711aa4a1e0c0e3d8d73c559f15c18/src/drivers/ncmpio/ncmpio_file_io.c#L165

The check for the NC_SHARE property on an open file that guards the call to MPI_File_sync https://github.com/Parallel-NetCDF/PnetCDF/blob/bdfe3479ab5711aa4a1e0c0e3d8d73c559f15c18/src/drivers/ncmpio/ncmpio_vard.c#L375

A document describing data consistency in pnetcdf: https://svn.mcs.anl.gov/repos/parallel-netcdf/tags/v1-7-0pre1/README.consistency

This notes that opening a file with NC_SHARE is expensive on most file systems, and as an alternative, it suggests that users instead add explicit calls to ncmpi_sync() in places where needed. Given that advice, hopefully well-behaved pnetcdf applications will include all necessary calls to ncmpi_sync which would minimize calls to MPI_File_sync.

Given this, I think a "well-behaved" pnetcdf application, meaning that it has all necessary ncmpi_sync calls, should work on UnifyFS. Though it also seems like there could be pnetcdf apps that are missing some calls to ncmpi_sync that might just happen to work today on file systems that implement POSIX data consistency semantics. In that case, we should be able to get those "bad" pnetcdf apps working by modifying their file open to include the NC_SHARE option. That will slow performance since it may force MPI_File_sync calls in places where they're not really needed.

adammoody commented 3 years ago

There is also a burst buffer mode which could be interesting to investigate further: https://github.com/Parallel-NetCDF/PnetCDF/blob/master/doc/README.burst_buffering

clmendes commented 3 years ago

@adammoody , I tested this today, after getting a fresh pull of dev as you suggested, which includes PR-569. I tested the original mpi-io code above (i.e. without any MPI_File_sync call) under two scenarios:

a) including the setting of UNIFYFS_CLIENT_WRITE_SYNC=1 in my job script. Everything worked as expected: the read gets 12 bytes, and the file (datafile-u) has exactly the same contents as the non-Unify version (datafile-m).

b) Without including UNIFYFS_CLIENT_WRITE_SYNC=1 in my job script, and in this case the behavior is kind of strange: the read obtains 100 bytes! The output printed is "All done, number_read=100". This is different from what I used to obtain before (it would get 5 bytes).

Since scenario (b) above should not contain your "sync", I believe its execution path is whatever is produced by the recently merged branch "new-margotree-abtsync-filehash", from PR-566, right? This is odd, because when I employed Mike's margotree branch in the past, the read used to obtain 5 bytes ("All done, number_read=5"). So, the behavior under the "new-margotree-abtsync-filehash" branch is different from the behavior under their previous "margotree" branch, which I had been using in my HDF5 tests. Your changes would not cause this change, right?

I'll go ahead this Friday, and try to test this Unify dev version (which includes new-margotree-abtsync-filehash), both with and without UNIFYFS_CLIENT_WRITE_SYNC=1, with some of the critical HDF5 examples, removing the H5Fflush call, and will see what happens.

clmendes commented 3 years ago

@adammoody , @kathrynmohror : I'm definitely having no luck with the "new-margotree-abtsync-filehash" branch. I've been doing tests with it for the last three days, beyond the test above that I had reported to Adam, but didn't get to a concrete success.

First, I tried one of the HDF5 tests where I had to insert H5Fflush in the past. For that test, running on two nodes, the code fails when trying to create a dataset. When the four processors are in the same node, the execution goes to the end , and the produced datafile is correct if I set UNIFYFS_CLIENT_WRITE_SYNC=1. This setting doesn't help in the multi-node execution, as I get the same error..

Second, I went back to one of the very simple HDF5 examples, which simply creates a dataset. This doesn't depend on UNIFYFS_CLIENT_WRITE_SYNC at all. If I run it in one node, it works, but it does not work on multiple nodes (and this used to work fine with the margotree branch in the past).

Third, I tried one of the MPI-IO tests, which uses collective I/O to write to a file. Again, it works fine in one node, but fails when running on multiple nodes.

@CamStan : do the tests exercise an execution on multiple nodes? If so, do they pass when running the current dev version (based on the new-margotree-abtsync-filehash branch, plus Adam's 'sync' from PR-569)?

I don't really know what changed between the "margotree" and the "new-margotree-abtsync-filehash" branches, but even the behavior of the reproducer code above is different under those two branches when I don't use UNIFYFS_CLIENT_WRITE_SYNC, which makes me suspect they have non-negligible differences. Maybe @MichaelBrim could help us clarify what those differences are?

Sorry for not being more positive, but I'm running out of options on how to make this new version work under multiple nodes.

MichaelBrim commented 3 years ago

The main difference between new-margotree and dev is that only one server (the "owner") maintains a given file's full metadata until it is laminated, at which time the metadata is broadcast to the other servers. There is no laminate in this example code, so that isn't relevant. What is relevant is the owner server, which is calculated as (gfid % #servers). In a lot of cases, the owner will not be server 0, so the read by rank 0 might be remotely accessing the metadata. Are you still using the example where the MPI_File_sync() calls have been added after the MPI_File_write_at_all()?

Another question, is it safe to access the non-standard field (count_lo) of MPI_Status, especially when the MPI_File_read_at() could be returning an error? You should really be checking the MPI return codes in the example.

CamStan commented 3 years ago

Yes, the CI tests current default is to execute on 4 nodes with 2 processes-per-node and the most recent dev branch is passing.

That said, the CI suite is currently only running the examples with posix I/O and has yet to be expanded to run with MPI-IO or HDF5.

clmendes commented 3 years ago

@MichaelBrim , I am not using the MPI_File_sync any more, I was just counting on the sync that Adam enabled via the setting of UNIFYFS_CLIENT_WRITE_SYNC=1.

But you raised a good point, I was not checking the return value of the MPI_File_read_at() call. I added a check for that now, but the behavior is still different when running on one and two nodes (4 processors in both cases, and no setting of UNIFYFS_CLIENT_WRITE_SYNC):

a) Execution on two nodes:
Output --> All done, ret=872480032 number_read=100 Datafile has errors in positions [101:107]

b) Execution on one node: Output --> All done, ret=0 number_read=5 Datafile has errors in positions [101:107]

If I set UNIFYFS_CLIENT_WRITE_SYNC=1, it will run fine with either one or two nodes, producing the correct datafile and the expected output in both cases: All done, ret=0 number_read=12

Thus, for this example, UNIFYFS_CLIENT_WRITE_SYNC=1 brings a correct execution, regardless of the number of nodes in use. Without UNIFYFS_CLIENT_WRITE_SYNC=1, the behavior changes between the one- and two-node executions.

For the HDF5 test, however, even when I set UNIFYFS_CLIENT_WRITE_SYNC=1, the behavior is different. With one node, it works fine, and produces the correct hdf5 file. But with two nodes, I get either a server crash in the first node or a partial/bad hdf5 file, coupled to a consistent client execution error just after the function that creates a dataset:

Proc 0: Parallel ERROR VRFY (dataset created succeeded) failed at line 683 in t_coll_chunk.c aborting MPI processes

This is the application source code in that region of t_coll_chunk.c:

681: dataset = H5Dcreate2(file, DSET_COLLECTIVE_CHUNK_NAME, H5T_NATIVE_INT, 682: file_dataspace, H5P_DEFAULT, crp_plist, H5P_DEFAULT); 683: VRFY((dataset >= 0),"dataset created succeeded");

I believe this function actually writes data to the file, according to my previous tests with Recorder. And this is precisely the location where I had previously added the call to H5Fflush(), such that the code would produce a correct hdf5 file. So, in the past (i.e. with the 'new-margotree' branch), a two-node execution would produce an incorrect datafile without the H5Fflush call, but it would not abort like the current one does!