Behavior when user passes a state_file to AXL_Init()

tonyhutter commented 4 years ago

How should AXL behave if a user passes in a state_file to AXL_Init()? This was discussed in #73.

Our current documentation is vague:

/** Read configuration from non-AXL-specific file
 * Also, start up vendor specific services */
int AXL_Init (const char* state_file);

We need to nail this down. Should it cancel old transfers? Should it try to restart them? Should it try to resume them? How do we get the old AXL IDs back? What happens if you AXL_Add() new files to an old ID?

tonyhutter commented 4 years ago

Just to get the ball rolling, here's a strawman suggestion for what the behaviour should be:

AXL_Init(state_file) should be thought of as resuming the exact state from before the app crashed. This means resuming/restarting old transfers from where they left off (sync and pthreads), and keeping any BB API transfers going that were active.

If the user wanted to cancel those transfers, they'd either call AXL_Stop() (big hammer), or get the IDs and AXL_Cancel() them individually.

We'd add the following functions:

AXL_Get_IDs(void): Return an allocated array of IDs (array is terminated with an entry of -1). Array must be freed.

AXL_Get_Files(int id, char **src, char **dst): Return arrays of source and dest paths for an ID. Arrays are terminated with NULL entries. They must be freed.

Example:

AXL_Init("/tmp/state_file");

...

int *ids;
char *src, *dest;
int i, j;

ids = AXL_Get_IDs();
for (i = 0; ids[i] != -1; i++) {
    printf("ID %d files:\n", ids[i]);

    AXL_Get_Files(ids[i], &src, &dest);
    for (j = 0; src[j] != NULL, j++) {
        printf("\t%s -> %s\n", src[i], dest[i]);
    free(src);
    free(dest);
}
free(ids);

These two functions should hopefully expose enough info to the upper layers to resume what they were doing.

tonyhutter commented 4 years ago

It might be better to return the number of entries rather than -1/NULL terminate the arrays for the new functions:

/*  
 * Allocate and write an array of IDs into ids[].  The number of entries is
 * returned (or -1 on failure).
 */
int AXL_Get_IDs(int **ids);

/*
 * For a given id, allocate an array of all the source file paths and all
 * the destination paths, and return them to the user in src and dest.
 * The function itself returns the number of entries in src and rest, or
 * -1 on failure.
 */
int AXL_Get_Files(int id, char **src, char **dest);

gonsie commented 4 years ago

I think @adammoody should really comment here. The in-source comment is confusing me:

Read configuration from non-AXL-specific file.

I’m not sure what the original intent was.

@tonyhutter I like the strawman you presented and I think those functions would definitely be helpful to any upper layers calling AXL.

adammoody commented 4 years ago

Thanks for starting this, @tonyhutter . I like the high-level stated goal, and those query functions I'm sure would be helpful. I haven't had a chance to think about things from SCR, but this is a situation where all of our hard work is figuring out what we need.

Ultimately, we want AXL to do whatever SCR/VeloC require it to do in order for them to be functional. There are two places where this AXL state file comes into play: 1) on restart, 2) on scavenge. Each of those is fairly complicated, given the multitude of failure cases, storage options, and AXL transfer modes that we have to deal with. To define what we need from AXL, we have to think through all combinations of these cases in turn, and we should be making changes in SCR along the way to be sure it works since it's too hard to think about abstractly.

Consider the restart use case first, where SCR restarts a run in the same compute node allocation, but perhaps not on the same nodes.

There are at least three common restart situations we need to think about: F1) no node failure -- the job is started on the exact name nodes, MPI ranks probably did not change nodes, but SCR allows for that if the user wanted to for some reason F2) node failure -- the job is running using some of the same nodes, some spare nodes for the failed nodes, and MPI ranks will have changed nodes F3) partial node failure -- there are failure cases where the node is no good for the job to restart, but it may still be doing a BB transfer, like a failed GPU

There are two options where SCR might direct AXL to save its file: S1) node-local storage, in the SCR control directory where SCR keeps its other meta data for the job S2) parallel file system, in the SCR_PREFIX directory where SCR keeps its flush file to track ongoing transfers, or a file per dataset in the respective dataset directory

And of course the different AXL transfer options, to start with: T1) pthreads T2) BBAPI T3) others for expected future systems?

My instinct says that we should try to figure out how we can have SCR store the AXL file in node-local storage (S1 above). As currently written, AXL reads and write its state file a lot, so moving that file to node-local storage would be really nice. The challenge with node-local storage is that the AXL state file will be lost on node failures (F2 and F3). We could manage. We'd just need SCR to encode the state file so it can be rebuilt on restart. However, that means both SCR and AXL will be reading/writing that state file, like when SCR encodes it, so we need to think about coordination (so perhaps we need AXL to provide lock/unlock api calls).

As a fallback, we could put the AXL state file in the parallel file system in the SCR_PREFIX directory. I still cringe about the performance loss here, but maybe we have to.

Some other things we have to figure out.

On a restart, should SCR map the AXL file back to the same MPI rank that created it, or should the file be mapped back to the compute node based on its hostname ignoring the MPI ranks that are running there?

A somewhat related question is should we have SCR run an AXL instance per MPI process or one per compute node? Currently, SCR runs an AXL instance per MPI rank in the job. I can think of some benefits to changing to an AXL instance per host, but we have some work to do to go that direction.

There are some interactions with the failure modes and transfer models we need to consider. In short, we want to do our best to cancel BB transfers, including the partial node failure (F3) where a BB transfer is still happening on a node that SCR can no longer use.

We should think about what happens if the job dies in the middle of AXL writing to its state file. It feels like we should try to protect against that, and perhaps detect it. What do we do if we detect corruption?

I suspect there are some subtle issues to consider between the points in time when AXL makes a state change and when it marks that state change in its file. For example, what happens if AXL starts a BB transfer but then the job dies before AXL updates its state file. The restarted job won't know the transfer was actually started, or at least it won't from the info in the file. Some of these consistency issues can be handled by having AXL update its state file before it starts the transfer, in which case, the restarted job knows that a transfer may have been started. We need to think about each such AXL state change with this in mind.

I'm sure there are other questions I haven't thought of yet.

That's restart. Then there's scavenge. Scavenge is different from restart because we may not want to cancel existing transfers, and we may need to initiate new transfers. Additionally, we should think about VeloC independently, which may impose different requirements.

Since this is complicated, we could start out by just worrying about a subset of these things ignoring the others, knowing that means we may need to redo/undo work later.

tonyhutter commented 4 years ago

Thanks @adammoody this is all really good info.

As a fallback, we could put the AXL state file in the parallel file system in the SCR_PREFIX directory. I still cringe about the performance loss here, but maybe we have to.

We should benchmark this. Maybe it's not as cringe-inducing as you think. I don't think AXL would be writing the state_file during the transfer, only during setup and cleanup of them, right? If so, then it might not be so bad. If the performance hit is in fact minor, we could just put state_file on PFS and it would simplify things greatly.

Since this is complicated, we could start out by just worrying about a subset of these things ignoring the others, knowing that means we may need to redo/undo work later.

I agree. I'm working in implementing what I have proposed in https://github.com/ECP-VeloC/AXL/issues/76#issuecomment-661316481, and we can use that as a starting point, and tweak the behavior. That will at least get the state_file stuff working in some form. Using that, we can experiment with it with a node-local or PFS state_file and see what the performance hit would be.

adammoody commented 4 years ago

A few more things when thinking of the parallel file system...

When storing the state files on the parallel file system, we have the opposite problem. Not only do the state files survive restarts within an allocation (a good thing), they persist across different compute node allocations (might be a bad thing). If that's a problem, we'll need to write up some code to detect/clean up old AXL state files that exist from previous allocations. SCR already has to deal with this kind of problem for other metadata it keeps on the parallel file system, like its flush, halt, and nodes files. So we don't have to write code to encode the state files, but we have to write some to clean up old stuff.

We're not doing this yet in AXL, but it might be useful to allow some transfer mechanisms to checkpoint their progress in the state file. For example, with pthreads, each thread might update the state file after it completes each file. Or for transferring really large files, AXL could periodically fsync the data it has written so far and record the current file offset in the state file. On a restart, it could then pick up from where it last checkpointed instead of having to start the transfer from the beginning. In that case, AXL might be writing to its state file more frequently.

We also need to consider using AXL to transfer between storage levels. SCR and VeloC could use AXL to copy a file from ramdisk to the node-local SSD, or from node-local storage to say rack-level storage. Those transfers might be much more frequent than transfers to the parallel file system itself. In this case, the node-local ramdisk --> node-local SSD transfer will require writing to the parallel file system. It's on the roadmap to do that in both SCR and VeloC, so that the app can always write its output to the fastest level (ramdisk) and rely on those frameworks to transfer to other levels in the background.

adammoody commented 4 years ago

We're not doing this yet in AXL, but it might be useful to allow some transfer mechanisms to checkpoint their progress in the state file. For example, with pthreads, each thread might update the state file after it completes each file. Or for transferring really large files, AXL could periodically fsync the data it has written so far and record the current file offset in the state file. On a restart, it could then pick up from where it last checkpointed instead of having to start the transfer from the beginning. In that case, AXL might be writing to its state file more frequently.

Oh, another idea. If AXL did checkpoint its progress, this would be a case where we would not want AXL to automatically restart its transfers, but instead we'd want AXL to start up in a paused state and wait for SCR to tell it to resume. SCR would rebuild/shuffle the source data files, and after that, it would be safe for AXL to resume. So then, we'd want something like:

SCR_Init()
  AXL_Init(state_file) --> AXL starts up and reads its state file, but it does not immediately restart transfers
  SCR shuffles and rebuilds source data files
  AXL_Resume(id) --> now AXL could pick up from where it left off

We could probably get this to work for pthreads, but BB would be harder. SCR might just have to cancel/restart BB transfers. But we'll have to track the type of each transfer, so maybe there should be a query function that reports the transfer type in addition to the id. I also wonder if AXL should report the state of each transfer, like whether it is dispatched or not.

adammoody commented 4 years ago

We used to have a lot of this capability in the pre-component version of SCR. In that case, async transfers were done using a separate "transfer" process that ran on each compute node along side the application processes. The SCR library coordinated with that transfer process using a shared file that was written in node-local storage (ramdisk actually), and it relied on file locking to control access between the SCR library and the transfer process. The transfer process did in fact record fields its progress, e.g.,:

https://github.com/LLNL/scr/blob/legacy/src/scr_transfer.c#L729 https://github.com/LLNL/scr/blob/develop/doc-dev/rst/developers/file_transfer.rst

In reality, this was used more so SCR could report progress to the user, but it could have also been used to checkpoint/resume a transfer after a restart. I hadn't gotten around to encoding the transfer files to protect them from node failure. They were deleted and recreated on a restart, but they could have been encoded/rebuilt.

We didn't carry this stuff over in our initial push to create the components, but I was hoping we could bring some of it back over time. I realize this complicates the design of the components, but I guess my drive there is coming from my regret of losing functionality that we used to have. Even so, the required complexity that would be needed to bring some of this back with components will be too high to justify the effort.

tonyhutter commented 4 years ago

1.

We're not doing this yet in AXL, but it might be useful to allow some transfer mechanisms to checkpoint their progress in the state file.

For pthreads and sync, I don't think checkpointing is necessary, as we can derive the state information we need by looking at the existing ._AXL destination file. The presence of an ._AXL extension means the file didn't finish transferring or is transferring and can be resumed. And the size of the ._AXL destination file tells you how much of the file has been transferred (and thus the resume offset). We do write the transfer status to the kvtree (AXL_KEY_STATUS), so we can look at that too.

2.

We also need to consider using AXL to transfer between storage levels. SCR and VeloC could use AXL to copy a file from ramdisk to the node-local SSD, or from node-local storage to say rack-level storage.

This is a really good point. We really should specify the state_file in AXL_Create() rather than in AXL_Init(). That would give the application the option to do both state_file and non-state_file transfers. It would also remove the need for AXL_Get_IDs(). We could simply remove the state_file argument from AXL_Init() and add it to AXL_Create(). We'd need to rework things internally to move all ID-specific values to their own kvtree.

The following would work to maintain backwards API compatibility:

#include <stdio.h>

#define ARG0(dummy, a0, ...) a0
#define GET_ARG0(...) ARG0(dummy, ## __VA_ARGS__, 0)

#define AXL_XFER_SYNC 1

void __AXL_Init(char *state_file)
{
        printf("%s: state_file is %s\n", __func__, state_file);
    if (state_file) {
        printf("WARNING: AXL_Init(state_file) is deprecated.  Please use AXL_Create(type, name, state__file)\n");
    }
}

#define AXL_Init(...) __AXL_Init(GET_ARG0(__VA_ARGS__))

static void __AXL_Create(int type, const char* name, char *state_file)
{
        printf("%s: state_file is %s\n", __func__, state_file);
}

#define AXL_Create(type, name, ...) \
        __AXL_Create(type, name, GET_ARG0(__VA_ARGS__)) \

int main(int argc, char**argv)
{
        char *state_file = "/tmp/state";
        AXL_Init();
        AXL_Init(state_file);
        AXL_Create(AXL_XFER_SYNC, "appname");
        AXL_Create(AXL_XFER_SYNC, "appname", state_file);

        return 0;
}

3. I like your idea about not auto-resuming transfers and explicitly calling AXL_Resume(id).

maybe there should be a query function that reports the transfer type in addition to the id. I also wonder if AXL should report the state of each transfer

Yea, we could rename and expand AXL_Get_Files() to a more general AXL_Get_Info():

AXL_Get_Info(int id, *type, *status, char **src, char **dst):

adammoody commented 4 years ago

And the size of the ._AXL destination file tells you how much of the file has been transferred (and thus the resume offset).

I don't know that it's correct to assume that the size of the file will tell us how much data was actually written. It may be valid to have a storage system that sees that we've written a bunch of data from the client and preallocates space to the file before it actually transfers those bits to disk. To be correct, I think we have to wait until fsync returns on the client to be sure the data has made it all the way to disk.

Or you could have a file system that transfers blocks of data in strange ways, so that perhaps the last byte of the most recent block has been written to disk, but some middle portion of the file has yet to be flushed.

adammoody commented 4 years ago

We don't need to bother with backwards compatibility (yet). So far the only users of AXL are SCR and VeloC, and we can change both of those. For now we can feel free to change the interface at will, and we'll probably need to for a while still while working through this issue.

tonyhutter commented 4 years ago

I don't know that it's correct to assume that the size of the file will tell us how much data was actually written. It may be valid to have a storage system that sees that we've written a bunch of data from the client and preallocates space to the file before it actually transfers those bits to disk.

If you're doing normal POSIX calls like we do with pthread/sync, then the file size would have to be consistent (unless the filesystem wasn't posix compliant). I know BBAPI does do preallocation and the filesize can't be trusted, but that's not using POSIX. We wouldn't be able to resume those transfers anyway since BB API doesn't provide a way to do that - we'd only be able to let the transfer keep going, or restart it. I dunno how DataWarp behaves.

A fsync wouldn't matter to us as far as file size goes. We don't necessarily care if the destination data is in cache or on disk, as we're not trying to protect against fileserver failure. That is, if our application crashes halfway though a transfer to GPFS, and we restart the app and resume the transfer, we don't care if the file we're resuming is in the GPFS's buffer cache or on disk. Even if the fileserver did crash without all the data being synced, then when the fileserver came back up, it would just report the size from it's last consistent write to disk, which we could still resume from.

adammoody commented 4 years ago

I guess my point is that because this can happen on some storage systems, we need to be prepared to only trust data up to the writes that came before an fsync. Some of those don't even attempt to be POSIX compliant.

As an optimization, we can have an option where the user can tell us to skip the fsync when they are using a file system where they know that they can trust the size.

BTW, do you have a pointer to where POSIX says that this is a POSIX requirement?

I actually didn't realize this is even a requirement to be POSIX-compliant. In my searches so far, I'm actually finding more articles that suggest the opposite, such that it's valid for a file to have corrupt/garbage data after a system crash.

tonyhutter commented 4 years ago

I guess my point is that because this can happen on some storage systems, we need to be prepared to only trust data up to the writes that came before an fsync.

That depends on what you mean. You can't trust the data on-disk to be up-to-date after a write(), but you can trust the values that the POSIX calls return to you to be correct. For example, POSIX says you should be able to read back any data that's been successfully written:

After a write() to a regular file has successfully returned:

    Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

    Any subsequent successful write() to the same byte position in the file shall overwrite that file data.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html#tag_16_685

... that doesn't mean the data can be read back from disk. However, for the purposes of resuming an AXL transfer, we don't care if the partially-written-to destination file is in cache or on disk. We just care that the data is consistent.

BTW, do you have a pointer to where POSIX says that this is a POSIX requirement?

fstat() is defined in the POSIX spec to return the file size in stat.st_size.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html

I don't see how a POSIX compliant filesystem could return anything other than the correct file size here. And by correct, I mean the file size at the last consistent state of the file (either in cache or on disk).

I actually didn't realize this is even a requirement to be POSIX-compliant. In my searches so far, I'm actually finding more articles that suggest the opposite, such that it's valid for a file to have corrupt/garbage data after a system crash.

All POSIX says is that data is consistent on disk after a sync/fsync. In fact it doesn't even say that. It just says the data "scheduled for writing out to all file systems." and "the writing, although scheduled, is not necessarily complete upon return from sync().". So even if you call fsync(), and it returns, POSIX technically does not guarantee that the data is committed to disk!

That said, any filesystem worth it's salt will at least guarantee that the data is consistent on disk, if not always up to date.

tonyhutter commented 4 years ago

So irrespective of how we decide to do checkpoint/resume transfers, I'm seeing the code changes as:

Add a state_file arg to AXL_Create(), remove it from AXL_Init().
Add AXL_Get_Info(int id, *type, *status, char **src, char **dst) to get the transfer type/state/paths.
Add AXL_Resume(id) to resume a transfer.

adammoody commented 4 years ago

Yes, I think that's a good place to start.

adammoody commented 4 years ago

I agree that fsync is not fool-proof either. In fact, at one time we had disabled fsync from actually doing anything useful in Lustre at one point -- don't know if we have since turned that back on. Though using fsync is as close as we can get. I don't know of anything stronger that we can use.

I don't see anything in that text that says POSIX has to return a correct size.

adammoody commented 4 years ago

Or rather, I should say I do believe the size will be accurate in the sense that the EOF returned by read() will match what fstat reports as the file size, but this doesn't require that the actual data bytes in the file are valid. The bytes in the file might all be garbage.

adammoody commented 4 years ago

The difference I'm worried about comes up from this sequence of events:


write(fd) <-- we can be pretty certain bytes from the write are correct in file
fsync(fd)
write(fd) <-- who knows? in particular, the file system might have updated the
              file size to account for this write, but then the node died before the
              file system could copy the data from the user buffer to disk,
              so now we have some garbage in the file
<< node dies >>

tonyhutter commented 4 years ago

I don't see anything in that text that says POSIX has to return a correct size.

I think it's implied that it returns the correct file size. If not, it would say something ambiguous like "st_size is filesystem specific" like it does for st_blksize

blksize_t st_blksize    A file system-specific preferred I/O block size 
                        for this object. In some file system types, this 
                        may vary from file to file.

The difference I'm worried about comes up from this sequence of events: ...

If the filesystem is doing that, then it's a bug. The filesystem should not be updating the filesize before it has read the data from the user.

adammoody commented 4 years ago

Where is it defined to be a bug?

adammoody commented 4 years ago

It seems to be permitted, even expected for ext3/4:

https://help.marklogic.com/Knowledgebase/Article/View/209

The 'data=writeback' mode does not preserve data ordering when writing to the disk, so commits to the journal may happen before the data is written to the file system. This method is faster because only the meta data is journaled, but is not good at protecting data integrity in the face of a system failure.

If there is a crash between the time when metadata is commited to the journal and when data is written to disk, the post-recovery metadata can point to incomplete, partially written or incorrect data on disk; which can lead to corrupt data files.

adammoody commented 4 years ago

Regarding POSIX, one thing I didn't realize is that this text you cited about a successful-read-after-write is taken to apply even across failures.

After a write() to a regular file has successfully returned: Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

That means that if a process writes to an existing part of a file, and then immediately tries to read back the bytes that it just wrote, the file system would be forced to either copy the newly written data somewhere safe or mark that part of the file as invalid before it can successfully return from the read. If it does neither and the node dies before the data is copied, the file system would return old (stale) data for that part of the file to any subsequent read, which would violate the requirement.

That requirement is more stringent than I thought.

adammoody commented 4 years ago

I always leaned on fsync for checkpoint use cases based on this kind of text:

https://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html

APPLICATION USAGE The fsync() function should be used by programs which require modifications to a file to be completed before continuing; for example, a program which contains a simple transaction facility might use it to ensure that all modifications to a file or files caused by a transaction are recorded.

RATIONALE The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.

Summing up the situation so far, checkpointing the progress of a transfer is complicated. We can use the existing file size to restart the transfer on POSIX-compliant file systems. There are file systems where we can't trust the file size. fsync might help, but it also might not depending on the file system. File systems that aren't POSIX compliant don't have to do the right thing with fsync either. I/O programming is fun :-)

adammoody commented 4 years ago

If we move the state file from AXL_Init to AXL_Create, that leads to some questions about the ids that AXL returns during AXL_Create.

Right now, those ids tick up from a global counter with each call to AXL_Create. How does AXL recover that global counter after a restart, or are you thinking we drop the global id and move to something else?

You mentioned that with this change, we can drop the GetIds call you had originally proposed, but then how does the caller get its old id values? Does it need to remember those values on its own, or does it have to call AXL_Create again after restarting?

adammoody commented 4 years ago

I'm still stuck thinking about about whether knowing the size means we can always trust all data in the file even in the case of POSIX.

It seems like the following situation does not violate the read-after-write requirement above. Does this violate POSIX for some other reason?

fd = open("file")
write(fd, 100 bytes) --> write returns successfully
<< process crashes >>

<< on some other process >>
stat("file") --> returns and says st.st_size = 100
fd = open("file)
read(fd, 100 bytes) --> always returns EIO, so that read never returns successfully

tonyhutter commented 4 years ago

If the filesystem is doing that, then it's a bug. The filesystem should not be updating the filesize before it has read the data from the user.

Where is it defined to be a bug?

Because it hasn't verified the user data is valid:

char *buffer = NULL;
write(fd, buffer, 25);

If the FS updates the file size to 25 before checking 'buffer', then it's a bug. During and after that write(), the file size should be unchanged.

The 'data=writeback' mode does not preserve data ordering when writing to the disk, so commits to the journal may happen before the data is written to the file system

data=writeback is unsafe. If the sysadmin is doing risky things with their production filesystem, there's not much we can do about that. From your article:

Linus Torvalds comments on 'data=writeback'

"it makes things much smoother, since now the actual data is no longer in the critical path for any journal writes, but anybody who thinks that's a solution is just incompetent.  We might as well go back to ext2 then. If your data gets written out long after the metadata hit the disk, you are going to hit all kinds of bad issues if the machine ever goes down."

Summing up the situation so far, checkpointing the progress of a transfer is complicated. We can use the existing file size to restart the transfer on POSIX-compliant file systems. There are file systems where we can't trust the file size. fsync might help, but it also might not depending on the file system.

At the application level, you really have no choice but to trust the filesize the filesystem tells you. If the filesystem is lying to you, either to achieve some optimization, or because of corruption, there's nothing you can really do about it. You have to trust the software layers under you.

How does AXL recover that global counter after a restart, or are you thinking we drop the global id and move to something else?

All IDs would be per-AXL_Create() and not global to all callers of the library. So you could have app1 call AXL_Create() and get ID 1, and app2 also call AXL_Create() and get ID 1.

You mentioned that with this change, we can drop the GetIds call you had originally proposed, but then how does the caller get its old id values? Does it need to remember those values on its own, or does it have to call AXL_Create again after restarting?

They would call AXL_Create() after restarting. The ID it returns may or may not be the ID it had before.

It seems like the following situation does not violate the read-after-write requirement above. Does this violate POSIX for some other reason?

No, it's perfectly valid. The filesize is correct, it's just that the file data is on bad sectors or something.

adammoody commented 4 years ago

No, it's perfectly valid. The filesize is correct, it's just that the file data is on bad sectors or something.

If that example is valid under POSIX, then is shows there are cases where the size is incorrect from our perspective even on a POSIX-compliant file systems, meaning POSIX-compliance doesn't ensure we can trust the size.

tonyhutter commented 4 years ago

If that example is valid under POSIX, then is shows there are cases where the size is incorrect

I'm confused. Your example shows the file size is correct (stat returns st_size = 100). The file data itself simply can't be read. Just because the file data can't be read does not mean the file size is incorrect.

adammoody commented 4 years ago

Consider a file system that implements this write by atomically increasing the file size and marking the newly added region as invalid. It then copies the data from the user buffer to storage, after which it marks that new region as valid. Any read from an invalid region returns EIO.

That meets the read-after-write requirement so that a successful read only returns data written by the app. It solves that by just always returning an error. So then the question is whether this implementation violates POSIX in some other way.

tonyhutter commented 4 years ago

Ok I see what you're saying. Pedantically speaking, the example is not POSIX because read() should be returning EAGAIN instead of EIO. If read() did return EAGAIN, then I would agree it is POSIX compliant, and possibly a valid strategy for a filesystem to take. For example, you could be reading from a file on tape with O_DIRECT.

tonyhutter commented 3 years ago

This has essentially been resolved with With https://github.com/ECP-VeloC/AXL/pull/77:

The state_file argument was moved from AXL_Init() -> AXL_Create(). This gives us per-AXL-ID state_files.
AXL_Resume() was implemented.

ECP-VeloC / AXL

Behavior when user passes a state_file to AXL_Init() #76