LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
108 stars 31 forks source link

Design: Lamination while file is being written #345

Open tonyhutter opened 5 years ago

tonyhutter commented 5 years ago

Describe the problem you're observing

Lets you have two nodes writing 100MB to a file at non-overlapping offsets. Node 1 finishes writing and does a laminate while node 2 is still writing its last 10MB. What should the behaviour be?

  1. node 2's last 10MB of writes return EINVAL?
  2. node 2's writes complete, and it can continue to write to the file until it's closed(). Effectively, this means that "is this file laminated?" is only checked on file open(), and the file is only "laminated and set in stone" after all writers close() the file.
  3. node 2's writes complete successfully, but they can't be read() since they happened after the lamination.

According to pg 6 in the docs (https://buildmedia.readthedocs.org/media/pdf/unifycr/dev/unifycr.pdf) after a lamination:

write: All writes are invalid.

...which sounds like it would be 1 or 3.

What behaviour do we expect?

adammoody commented 5 years ago

Well, we could argue that all three outcomes are possible. The details of our model are still up for discussion, but here is one proposal.

The proper use for lamination is for the application to ensure that all writes have been committed to the file before it is laminated. For example, in an MPI job, each writer would have to fsync() and/or call a local laminate() function to indicate that it is done writing and that all data it has written thus far should be committed. Then all writers would synchronize with each other somehow, say with MPI_Barrier(). Finally one process would execute a global lamination function. This sequence ensures that all writers have committed their data before the file is laminated. If an application fails to follow those semantics, the file system behavior is undefined.

Since the above scenario does not follow the required model, it would fall into the undefined clause, so we could say that any outcome could be possible.

tonyhutter commented 5 years ago

I like your proposal, with the only change being that close() would be the "local laminate()". So like:

When a client process does a laminate() RPC, the server only returns from the RPC after all writers have closed() the file. This should be fine, as the writers would be synchronized anyway though the application (via an MPI_Barrier()), so laminate would naturally get called after everyone had closed(). This allows us to check for "is the file laminated" at open() time only, but means that the client would have to re-open the file to read() it (which may not be a big deal).

adammoody commented 5 years ago

We have gone back and forth on whether a close() should be an implied local laminate(). One issue is that we know of apps where a given process will open and close a file multiple times while writing it. For those apps, close() can't be a local laminate. We'd need a separate API to serve as the local laminate. We could optionally run unify in a mode where the app tells us they want close to behave like a local laminate, but we may still want to support cases where close does not imply laminate.

adammoody commented 5 years ago

Another issue to consider is that different procs might operate on the file at different times. One process opens/writes/closes the file, then another process later in time opens/writes/closes the file. If we want to support those cases, the server doesn't know when the global laminate should be applied. That's a case where I think we need a separate global laminate API call.

craigsteffen commented 5 years ago

Craig Steffen here, new team member from NCSA.

Is there any reason not to make the unify_laminate_now() function call a specifically globally collective function that can ONLY validly be called from all ranks simultaneously? And if it doesn't, then it throws an error? That would eliminate any ambiguity and there wouldn't be an entire space of undefined behavior.

On Tue, Jul 23, 2019 at 4:47 PM Adam Moody notifications@github.com wrote:

Another issue to consider is that different procs might operate on the file at different times. One process opens/writes/closes the file, then another process later in time opens/writes/closes the file. If we want to support those cases, the server doesn't know when the global laminate should be applied. That's a case where I think we need a separate global laminate API call.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LLNL/UnifyCR/issues/345?email_source=notifications&email_token=ABKJAYQ4T2C6M4LHBPLNUZ3QA5U7FA5CNFSM4IGH657KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2UMPII#issuecomment-514377633, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKJAYS2ATHFXVQU2PDDQI3QA5U7FANCNFSM4IGH657A .

tonyhutter commented 5 years ago

Thanks @adammoody for the info on the open/writes/closes. What do you think of:

local laminate: change permissions of the file to read-only global laminate: fsync

Overall, I think it's beneficial if we can re-purpose existing system calls instead of creating Unify-specific API calls. The holy grail would be that a user doesn't need to recompile their application at all (other than use the "/unifycr" mount). Also, if we don't have a Unify-specific API, and we eventually get a FUSE driver working or a Unify-LD_PRELOAD'd shell environment, we could easily write our test cases in scripts.

adammoody commented 5 years ago

@tonyhutter , good point. We ended up trying to describe that here in the Consistency Model Section 3.2:

In the first version of UnifyCR, lamination will be explicitly initiated by a UnifyCR API call. In subsequent versions, we will support implicit initiation of file lamination. Here, UnifyCR will determine a file to be laminated based on conditions, e.g., texttt{fsync} or texttt{ioctl} calls, or a time out on texttt{close} operations. As part of the UnifyCR project, we will investigate these implicit lamination conditions to determine the best way to enable lamination of files without explicit UnifyCR API calls being made by the application.

To do this and to support a number of existing apps, I'm guessing we'll want some things to be configurable, e.g., App A wants to treat close() as local laminate while App B wants to use fsync() as local laminate.

adammoody commented 5 years ago

@craigsteffen , I think the other angle we have in avoiding requiring a collective lamination is that we could support different parallel programming models, including cases where the two writer procs might not be running at the same time. Having said that, it would likely be useful to provide an MPI wrapper around unify calls, where someone could pass in a file name and an MPI communicator, e.g.,


void MPI_Unify_laminate(const char* file, MPI_Comm comm)
{
  unify_fsync(file);
  MPI_Barrier(comm);
  // all MPI ranks in comm now know data has been committed

  int rank;
  MPI_Comm_rank(comm, &rank);
  if (rank == 0) {
    unify_laminate(file);
  }

  MPI_Barrier(comm);
  // all MPI ranks in comm now know file has been globally laminated
}
tonyhutter commented 5 years ago

Thinking about it more, we can get rid of the idea of a local lamination and just simplify it to "laminate() blocks until everyone (readers and writers) has closed the file". In the common case it will be fine, since we're assuming the app has self-synchronized when to call laminate(). It would also work for apps that do multiple open/write/closes before lamination.

MichaelBrim commented 5 years ago

@tonyhutter you need to be clear about the meaning of "everyone". Do you mean all readers and writers on the local node, or across all nodes? We have no easy way to determine the latter case.

tonyhutter commented 5 years ago

Across all nodes. The nodes would have to send an RPC on close(). Alternatively, you could remove the safety and simply say "if you call laminate(), you must be 100% sure that all nodes have closed the file, and you accept any badness or corruption that can happen if a node is still accessing it" (which is basically what it says in the design doc). I'd be fine with that too.