Design: client library API

MichaelBrim commented 5 years ago

Currently, all client library functionality is designed around direct interception of existing I/O APIs (either via linker wrapping or GOTCHA). As a result, there really isn't a defined client library API other than unifycr_mount and unifycr_unmount. This leads to quite a bit of redundant code, and doesn't adequately support alternate uses, such as

direct application clients, like Hyogi's command line tools effort (#148)
VeloC memory-based API
memory-based object stores and shared/anonymous mmap regions (related to #248)

This issue is to track the code reorganization needed within the client library to cleanly abstract the UnifyCR functionality from the upper-level uses. Ideally, we would end up with a libunifycr that could be used directly by an application, or as a support library for by libunifycr-posix (and perhaps libunifycr-mem). My suggestion is that the libunifycr API needs a thoughtful design to support the various use cases, and should not be tied to POSIX APIs/semantics. The libunifycr implementation would contain all of the code to interact with the local unifycrd (i.e., RPCs and shmem communication, shared data types and any associated serialization).

adammoody commented 5 years ago

Just below the posix wrappers, things largely funnel into the unifycrfid* calls. That's a leftover artifact from the original CRUISE code, and that interface is still posix-like, but it should provide a better starting point than the wrappers themselves.

adammoody commented 5 years ago

The unifycr_fid abstraction assumes that a file is a linear array of bytes, where you have to explicitly allocate (extend) and free (shrink) storage capacity at the end of the array. That made a lot of sense in CRUISE, but it makes less sense in the case of unifycr.

MichaelBrim commented 5 years ago

full set of related issues: #43 #45 #46 #47 #141 #176 #197 #216 #225 #248 #252 #262

boehms commented 4 years ago

Should we start using Project boards to organize these kind of efforts?

MichaelBrim commented 4 years ago

UnifyFS-Client-API-Proposal-v02.pdf

See attached pdf for current proposed API.

MichaelBrim commented 4 years ago

Here's the markdown writeup. https://github.com/LLNL/UnifyFS/blob/client-api-doc/docs/client-api.md

tonyhutter commented 4 years ago

So I've finally had some time to fully review the API doc and the comments in this issue. I'll first go over the comments and then I'll go over what I see are the most important parts of the API proposal.

Currently, all client library functionality is designed around direct interception of existing I/O APIs (either via linker wrapping or GOTCHA). As a result, there really isn't a defined client library API other than unifycr_mount and unifycr_unmount.

That's a good thing. We should be happy when there's no custom, monolitic, client library API that our users have to learn and link against (and that we have to document and maintain). I'll be happy when unifyfs_mount/unmount are gone. Imagine being able to use any binary with UnifyFS without having to build against Unify or do anything. That would be amazing.

This leads to quite a bit of redundant code, and doesn't adequately support alternate uses, such as

direct application clients, like Hyogi's command line tools effort (#148)

VeloC memory-based API

memory-based object stores and shared/anonymous mmap regions (related to #248)

In general, we should only provide an API for things that POSIX doesn't support, or that we can't tack on to POSIX in some way, but only if there is a real user need for it. If it's a nebulous "the user may want this at some point in the future, possibly..." then we should wait. Otherwise we risk wasting time developing a feature nobody uses, or that's designed wrong (but we still have to document and maintain).

For example, we could do everything listed in #148 using POSIX:

current daemon status including the memory consumption across all nodes. Send SIGUSR1 to daemon to tell it to print it's stats
file system statistics, such as the number of files, space consumption, etc. Wrap statvfs()
a small shell-like environment where a user can interactively explore the namespace (cd, ls, ...) busybox #426, or fuse
moving files between unifycr volume and any other mountpoint (e.g., /lustre, /mnt/xfs, ...) Wrap splice() or copy_file_range(), or FICLONE ioctl()

I can't speak to the VeloC memory-based API or #248 since I'm unfamiliar with them.

So why is defaulting to POSIX a good thing?:

Compatible with existing programs.
Interfaces are reasonably well designed.
Interfaces are already documented. We only need to document the differences if we do something odd with them (like chmod()/laminate).
Can use default headers and libraries - you don't need to link against anything custom.
Our users are already familiar with POSIX interfaces.

Regarding the doc itself:

https://github.com/LLNL/UnifyFS/blob/client-api-doc/docs/client-api.md

After reading the doc I'm still not clear on what the actual requirements are for this API. I was hoping there would be a section where it would list the requirements from users. As in, "the HDF5 folks want the following features: A B C... and this is why they want them". There are reasons given in the "Motivations" section, but I'm still skeptical:

Although the original design accomplished its goal of easy application integration, the resulting client library code was not developed in a modular fashion that exposed clean internal APIs for core functionality such as file I/O. The lack of modularity has limited the ability of UnifyFS to explore additional integrations that would increase its usability."

For example, existing commonly used distributed I/O storage libraries such as HDF5 and ADIOS have modular designs that permit implementation of new storage backend technologies, but UnifyFS provided no API that could be leveraged to support such developments.

This proposed API increases modularity in the same way this increases modularity:

int UNIFYFS_WRAP(printf)(const char *format, ...) {
  va_list args;
  va_start (args, format);
  unifyfs_printf(format, args);
  va_end (args);
}

int unifyfs_printf(const char *format, ...) {
  va_list args;
  va_start (args, format);
  vprintf (format, args);
  va_end (args);
}

There, printf() is now more "modular". Has this improved anything? No. It seems to me that the API is similar. For the most part it just provides another layer of indirection with functions that are largely just variants of the POSIX ones (with some exceptions). I don't see a benefit from this. It's not like SCR (https://github.com/llnl/scr), where there were discrete, self-contained parts of the code that could be spun off into separate modules (and it was beneficial to do so). I don't see how the proposed API would help permit implementation of "new storage backend technolgies" anymore than the current codebase.

Further, users had no way of exploring the UnifyFS namespace from outside of their applications, since common system tools (e.g., Unix shells and file system commands) could not be used without explicitly modifying their source code. The UnifyFS team has explored various options to provide system tools (e.g., FUSE and custom command-line tools), but initial development on these approaches stalled due to the lack of appropriate APIs."

I agree, you can't implement a FUSE driver using the APIs we have now. For example, I don't think our opendir/readdir currently work. No doubt there's other functions we'd need to implement to. But the answer to this is to implement the missing functions, not design a totally new API from scratch. In fact, the proposed API would make it harder to implement a FUSE driver than if we were to implement the missing POSIX functions. Why? Take a look at the FUSE functions:

http://www.maastaar.net/fuse/linux/filesystem/c/2016/05/21/writing-a-simple-filesystem-using-fuse/

They're basically just analogs of the POSIX functions. fuse.chmod() would directly call UNIFYFS_WRAP(chmod), for example. It would be less awkward to call the POSIX functions than it would to call the unifyfs_* ones.

Finally, using interposition as the only means of affecting application I/O behavior meant that important UnifyFS semantics, particularly file lamination, required hijacking existing system calls like fsync() and chmod() to impart new meaning to their use in applications. For many applications, these system calls were not already used, and thus had to be added to the source code to effect the desired behavior.

If you're worried about it imparting new meaning to chmod(), we could change laminate to be something else, like an ioctl() or hijack fcntl(F_WRLCK) to set a "write lock" on the file you want to laminate. There are other ways to do it. chmod() was elegant since lamination would change the write bits anyway, and you could in theory laminate a file from the command line. Note that no matter whether you do a chmod() or a unifyfs_laminate(), you're already imparting new meaning on the filesystem, since the whole concept of laminating a file is not a normal thing.

Also, the doc makes the point that a user would have to add chmod() into their code to laminate a file. That's true. However, a custom API would also require the user to add even more code. Compare:

#include <sys/stat.h>
chmod();

to:

#include "unifyfs_api.h"
unifyfs_handle fshdl;
unifyfs_initialize();
unifyfs_laminate();
unifyfs_finalize();;
Makefile changes to add -lunifyfs_api -I/path/to/unifyfs/headers

The primary goal in providing a new client API is to expose all core functionality of UnifyFS. This functionality includes methods for file access, file I/O, file lamination, and file transfers.

95% of Unify's core functionality is already exposed though our POSIX wrappers. We should not duplicate that functionality in a custom API. For the 5% that is not, we should (and I'm repeating myself here):

Decide if there's an actual real-world need to implement it.
See if it can be exposed though POSIX in some way.
If not, then provide APIs to do it.

For example, the doc proposes an API for file transfers. Why not consider wrapping splice() or copy_file_range(), or the FICLONE ioctl()? Node 0 could call one of these to tell the server to initiate a "all nodes copy your section of file X to the parallel filesystem". In fact, vanilla cp uses FICLONE by default so if we ever do get FUSE working, cp could be really fast. That said, it may make sense to have a custom API call to transfer a list of files/dirs to save ourselves from having to multiple RPCs (open/slice/close) for each file we're transferring. We should gather data and benchmarks to see if that really is the case first before implementing it though.

Additionally, UnifyFS provides some user control over its behavior. Some of this behavior is already exposed via existing configuration settings that can be controlled by clients using environment variables. Where appropriate, it is useful to provide explicit control mechanisms in the API as well.

I assume this is referring to the proposed struct unifyfs_options that gets passed to unifyfs_initialize(). We could just wrap mount() and pass all those options as key=value pairs (in mount()'s 'data' field). That's going to be more extendable than using a fixed struct for configuration parameters.

Lastly, I wanted to talk about this diagram:

libunifyfs_posix     libhdf5
             |        |
             libunifyfs

I know this is the dream, but I feel it will quickly turn into this:

libunifyfs_posix     libhdf5
  |         |        |
  |         libunifyfs
  |             |
  internal UnifyFS APIs

Why? Because the top diagram is basically saying our internal API is libunifyfs and that's going to be a stable API. Internal APIs are never stable. They change all the time. Let give an example. Currently we have an internal API function called unifyfs_fid_create_directory(path). What if we wanted to add permission bits to that function? Well, we can just change the function to be unifyfs_fid_create_directory(path, mode). The user would never notice. Now let's say unifyfs_fid_create_directory() was in libunifyfs. In that case, we couldn't change the function prototype without breaking the user's code. Or, using the API proposal as an example:

unifyfs_rc unifyfs_create(unifyfs_handle fshdl,
                          const int flags,
                          const char* filepath,
                          unifyfs_gfid* gfid);

...what if we needed to make flags unsigned? Or make it a uint64_t? We'd have to break user's code to change the API. The point is, we can't know exactly how our internal functions will evolve over time, and so we shouldn't export them to the users as a stable interface.

I think this is a more likely diagram to shoot for:

              libhdf5
               |   |
               | libunifyfs
               |   |   |
     libunifyfs_posix  |
                   |   |
       internal UnifyFS APIs

So what would I include in an UnifyFS API?

Anything that HDF5 specifically asked for.
Server statistics about Unify, like RPC count stats, RPC latency histograms, read/write size histograms.
Current state of all clients. Are they all up and pingable? How much data have they transfered?
Transfer stats in/out of the /unifyfs filesystem. How many files did we transfer? File size histograms. Bandwidth histograms.
API for mass transfer of files/directories if benchmarks show we need that.

The API would expose these as nice, easy to use, stable functions to the user, and then call iocts() or internal functions under the covers to actually make things work.

adammoody commented 4 years ago

We can wait for later versions, but we'll want to include calls that can be used to operate on many files at once.

All basic posix calls require the user to operate on one file at a time. HPC easily generates datasets that have millions of files, so one-at-time is too slow. We want to have calls where a single command can be broadcasted down a tree to all servers, which can then operate in parallel.

Execute operations recursively on directories It would be nice to operate on a directory recursively. This has already been proposed in a walk() call. As one example, a prune() call that recursively unlinks all items starting from a directory down. It's easy to imagine doing other operations like chmod(), chown(), or setting utimes() for all items under a directory.
Support regex or find-like matches On top of that, we could offer the user a filter, like a regex or find match, to determine which subset of files to be updated, if not all.
Provide list-based interfaces to name more than one item at once In addition to a directory, we could provide methods where a caller provides a list of files names, in case there are a set of items, but maybe not under one top-level directory. And of course, the directory down w/ list could be combined so we can do both.
Range-based file creation In HPC jobs, it's very common to create a set of files with similar names like rank.0, rank.1, ..., rank.N. It would be nice if we could define a new range expression syntax for this, like mknod_range("rank[0:N]"). Here we would expand the embedded [0:N] range, so that with a single mknod() call from a single client, the server would know to go create say millions of files. Maybe we could use python range syntax or something.
Enable parallel access to readdir When trying to read the entries from a large directory, one is sort of stuck running through readdir (or perhaps getdents) using a single process:
```
dir = opendir("path")
while (entry = readdir(dir))
process entry
closedir(dir)
```
For a directory containing 6 million files as we had with jobs from Sequoia, this takes a long time. It'd be nice to have methods so this can be parallelized, in the same way that one can parallelize reading a file by stat'ing it with rank 0, bcasting the size to all ranks, and then having each rank lseek into a segment and start reading.

A similar interface could exist for reading items from a directory. We'd want a function to return the total number of items, say statdir() and then another function that lets one seek into the middle of the set, say lseekdir(offset). For example:

dir = opendir("path")

// return total number of entries in opened dir
long dirsize = statdir(dir)

// compute starting offset into directory items based on my rank
offset = (dirsize / nranks) * rank

// seek to starting offset for this rank
lseekdir(dir, offset, SEEK_SET)

// read my portion of the items
for (i = 0; i < dirsize/ranks; i++)
  entry = readdir(dir)
  process entry

closedir(dir)

And this could be combined with a range read to grab a whole collection of items at once:

end_offset = (dirsize / nranks) * (rank + 1);
struct dirent entries[end_offset - start_offset];
readdir_range(dir, offset, end_offset, &entries[0]);

Portable library for non-unify file systems Although these interfaces will enable fast performance for apps running on Unify, codes will also want this library to work across any file system, so they don't have to maintain two code paths in their I/O logic. We'd need POSIX-based implementations of each of these where possible, that hopefully still perform reasonably well.

MichaelBrim commented 4 years ago

@tonyhutter I will be brief in my answer. Your concerns are noted, but you are focusing on the wrong use case. We have two main classes of users - parallel applications and I/O middleware libraries. For existing applications, I am in complete agreement that we should be able to do 99% of what we want to do in the POSIX, MPI-IO, etc. calls they are already using in their application. Currently, the primary use case for this client API is embedding in other libraries, like HDF5. We have had several conversations with the HDF5 team about different non-POSIX behaviors we could offer them as useful capabilities. There is no reason why we should not provide them with a straightforward API for using those capabilities.

tonyhutter commented 4 years ago

Currently, the primary use case for this client API is embedding in other libraries, like HDF5. We have had several conversations with the HDF5 team about different non-POSIX behaviors we could offer them as useful capabilities. There is no reason why we should not provide them with a straightforward API for using those capabilities.

@MichaelBrim then you need to list exactly what HDF5's requirements are. What did HDF5 say they wanted in those conversations? HDF5 is only mentioned once in the Motivation section, and even then the requirements are vague:

For example, existing commonly used distributed I/O storage libraries such as HDF5 and ADIOS have modular designs that permit implementation of new storage backend technologies, but UnifyFS provided no API that could be leveraged to support such developments.

I brought up this lack of detail three months ago:

Regarding the doc itself:

https://github.com/LLNL/UnifyFS/blob/client-api-doc/docs/client-api.md

After reading the doc I'm still not clear on what the actual requirements are for this API. I was hoping there would be a section where it would list the requirements from users. As in, "the HDF5 folks want the following features: A B C... and this is why they want them". There are reasons given in the "Motivations" section, but I'm still skeptical:

Without knowing what HDF5's requirements are, how can we possibly know if this is this API is the best way to satisfy their requirements?

For example, I see you propose a unifyfs_remove() function:

 /* Remove an existing file from UnifyFS */
unifyfs_rc unifyfs_remove(unifyfs_handle fshdl,
                          const char* filepath);

I have no idea if HDF5 needs that or not. There's no "HDF5 asked for a way to remove files without using unlink() because of reason X, so here's what I propose" listed anywhere in the doc. How am I to know if this function is something that's really needed, or just re-inventing the wheel?

The design of APIs should be driven by requirements, and we all need to know what those specific requirements are. After we get the requirements we can then decide what is reasonable to implement and what that implementation would look like.

roblatham00 commented 3 years ago

I don't know if everyone on this issue was also on the "Unify/HDF5 discussion on MPI-IO" so i'll briefly repeat myself.

Here are three things I would like to see in a libunify API that you cannot get from wrapping posix open/write/read/close calls.

sane visibiltty rules: POSIX semantics say once a write call returns, everyone can see that value. Fine in 1970 -- single os, single cpu. Ridiculous in 2000. Unify knows this -- writes are not visible until lamination.
stateless open: POSIX file descriptors have meaning only on one OS. In PVFS for example a client could perform the resolution of path name to "handle", and then this handle could be broadcast to every other client. Is this a useful optimization? Please time open(testfile, O_CREATE|O_WRONLY, 0666) on GPFS with 30,000 MPI clients -- make sure your timer is prepared to report "hours" not "seconds".
descriptive I/O: scientific data movement is sometimes "giant contiguous stream of bytes" but it is far more often structured (collumn out of 2d array) or worse (unstructured grid)

With these items in place, it is still possible to provide legacy posix interfaces, including semantics.

roblatham00 commented 3 years ago

Oh, a fourth thing!

POSIX asynchronous i/o is awful. An HPC-oriented async i/o interface would look a lot different and perform a lot better (as we demonstrated with PVFS).

qkoziol commented 3 years ago

I concur with Rob's points - it's worthwhile to work on different aspects of the visibility-asynchrony-performance "iron triangle" and think what you are willing to ask for an give up. Today, it seems like giving up some visibility in favor of performance (by using more asynchrony) is a good choice.

With that in mind - I would suggest making all your API routines asynchronous, not just read/write/truncate/zero. Having asynchronous open/close/etc operations as well is quite useful.

LLNL / UnifyFS

Design: client library API #257