NVIDIA / proxyfs

Apache License 2.0
64 stars 25 forks source link

how to start using proxyFS with our existing saio? #550

Open tabeamiri opened 3 years ago

tabeamiri commented 3 years ago

hello everyone, I would like to use proxyFS with my existing saio, but unfortunately I did'nt find any useful documentation describing how proxyFS works or what the architecture is look like. I do'nt know how to start . I appreciate any suggestions.

edmc-ss commented 3 years ago

Hi tabeamiri...and welcome to ProxyFS!

Sounds like you already have Swift up and running...and want to add ProxyFS...great!

First things first...we have a Slack channel set up (proxyfs.slack.com) that might be more conversational to get going. Please join.

Next, it is important to understand what ProxyFS is doing. In a nutshell, ProxyFS provides POSIX functionality on top of a Swift Account. Those Containers you normally see inside an Account become top level directories in the root of a file system. Objects in those Containers are files in those (Container) directories...with a very important distinction. If two Objects, for example, are named "foo/cat" and "foo/dog", they will actually appear as files named "cat" and "dog" inside a subdirectory named "foo". Indeed, in aligning Container/Object with POSIX hierarchical file system concepts, there is the added restriction that you can't also have an Object named "foo" as that would collide with the subdirectory named "foo" in the example. Finally, ProxyFS also happens to support symbolic links as well...and these should not be considered equivalent to the symbolic link capability recently added to Swift.

Next up, it might be helpful to understand what protocols you intend to use to access a ProxyFS volume. There are, in fact, SIX distinct protocols simultaneously supported (i.e. all six can be used to talk to the files/objects in a ProxyFS volume):

SMB NFS Swift S3 "Server" FUSE mount "Client" FUSE mount

In a nutshell, ProxyFS presents it's file system io each of these in subtly different ways:

SMB - Samba "VFS" plug in binds a volume shared via SMB to a ProxyFS volume.

"Server" FUSE - ProxyFS presents the file system locally (i.e. on the node where ProxyFS is running) via FUSE.

NFS - By listing the "Server" FUSE mountpoint in that node's /etc/exports, the node where ProxyFS is running can NFS serve the volume.

Swift & S3 - a special middleware (pfs_middleware) is inserted in the Swift Proxy pipeline. This enables both Object APIs to present the file system as if it were a normal Swift Account.

"Client" FUSE - PFSAgent is a program built for a client environment that remotely accesses Swift and ProxyFS. It presents the file system locally (i.e. on the client) via FUSE. PFSAgent is particularly effective at gaining back the true scale out performance of Swift...at least for reads and writes (i.e. metadata operations are still carried out by a central ProxyFS instance). This is much more scalable than for the SMB and NFS mounting options.

Architecturally, ProxyFS sits along side a Swift Proxy. Indeed, there is a SECOND Swift Proxy instance - referred to as the NoAuth Swift Proxy since it lacks the "auth" middleware typical in Swift Proxy pipelines). This NoAuth Swift Proxy has a very short pipeline...that importantly includes the other middleware included in the repo: meta_middleware.

Configuring ProxyFS (and, indeed, PFSAgent as well) is accomplished with a .INI styled file passed as the lone argument to the proxyfsd program. As a rule, there are many config parameters...and there are no defaults. So it's very helpful to look at some examples (e.g. proxyfsd/defaults.conf). And, of course, for SMB & NFS, you'll want to provide smb.conf and /etc/exports as well.

File systems must be formatted on top of an empty Account. There is no mechanism to convert an existing Swift Account to a ProxyFS Volume. The mkproxyfs binary is used to format a volume based on the same config file ultimately passed to proxyfsd.

I'm sure this sounds like a lot...so where to start. Let me suggest you look in the saio subdirectory of the repo. There you will see a proxyfs.conf file with a handy example. There are scripts in a subdirectory of saio/ that will start everything up for you. Look particularly at start_and_mount_pfs...it starts Swift, then uses mkproxyfs to format "CommonVolume", then launches proxyfsd to serve it locally (via FUSE), then launches Samba and nfsd, then mounts via SMB & NFS. It will even launch pfsagentd and serve the file system that way as well (look in the pfsagentd directory for pfsagent.conf to see it's config).

Well that's a start. There is a lot to ProxyFS...plenty to dig into based on your interest & needs.

Feel free to reach out via the Slack channel I posted above...and good luck!

tabeamiri commented 3 years ago

Many thanks 🙏 for your comprehensive response. Sorry I could not join to proxyfs.slack.com due to some technically reasons. I'll try it later again. About the proxyfs and how it stores data, I do not underestand the details. If I have proxyfs ans saio, is data stored like a file system in proxyfs and not like swift structure, right? And does proxy node with the help of some middleware convert this structure to swift model ?

edmc-ss commented 3 years ago

Excellent questions...let me explain some inner workings a but.

First the real problems of putting a file system on Swift:

Eventual Consistency is a real challenge. We need to be sure that when we PUT an Object and later GET it, the data returns is what we last wrote. To accomplish this, ProxyFS literally always writes new Objects under the cover. That way, the GET either worjs as desired or we simply retry it.

We cannot modify portions of an Object either (ignoring Eventual Consistency concerns above even if there were none). So we write to new Objects only the changes. We call the sequence of Objects that make up the file's data LogSegments. Much like for Swift Static Large Objects ("SLO"), each file is described by the equivalent of a SLO Manifest that pieces together ranges of these LogSegment Objects to form the contents of a file.

Finally, POSIX file system users expect to be able to rename files and hard link multiple references to a file. With Swift, a rename must necessarily copy the file to its new path such that a subsequent GET (of the new path) can find it. Hard links would be possible by creating a new clone of the SLO Manifest...but subsequent modifications would be challenging to ensure updating of each SLO Manifest. The closest Swift can come to hard links is actually more akin to symbolic links.

So...under the covers, ProxyFS is basically maintaining each file as a SLO Manifest-like sequence of references to LogSegment Objects...and had to garbage collect those LogSegment Objects as they become unreferenced when files are modified or deleted.

In addition, ProxyFS must maintain directories...which simply provide a mapping from a basename to a file. Each "file" referenced is actually termed an Inode. The Inode has a type (file, directory, or symbolic link). In the case of a FileInode, it includes that SLO Manifest to describe each "extent" of the file. Indeed, we call this SLO Manifest equivalent an ExtentMap.

You can actually see what an ExtentMap looks like via HTTP. Issue a command like this:

curl http://:15346/volume//extent-map/foo/bar/cat

That port :15346 is to reach the embedded HTTP Server inside ProxyFS...and is, of course, configurable.

The /foo/bar/cat is the path to the file (equivalently, an Object named "bar/cat" inside a Container/Bucket named "foo").

As you can imagine, a file's ExtentMap can get very elaborate...e.g. if a file is written randomly. To scale up to this complexity, the ExtentMap is stored as a B+Tree....much more powerful than the SLO Manifest that is merely a linear list of "extents".

Directories, too, might get extremely large and yet to support lookups and modifications quickly. So they are implemented as B+Trees as well. Finally, all those directory entries point to Inodes...that table of Inodes is, itself, a B+Tree as well.

As the file system is modified, all that really happens is updates to those three B+Tree types (FileInode ExtentMaps, DirectoryInode lists of dir-entries, and the InodeTable) are merely "logged" in a special Container: ".checkpoint". Since we record such B+Tree changes in logged form, ProxyFS even supports snapshots!

Finally, you asked about the way that middleware (pfs_middleware) enables Object API access to the file system. Let's take the easiest example: GET. As the GET moves thru the Swift Proxy pipeline, it hits pfs_middleware that determines (based on a Header on the Account) that the Account is "BiModal"...meaning it is managed by ProxyFS. So instead of simply passing thru the GET path, it asks (via a JSON RPC) ProxyFS for a copy of the ExtentMap for the range requested. Then, it "plays" the ExtentMap much like the "slo" middleware does...issuing GETs for each extent. PUTs are kind of the reverse...where pfs_middleware places the content of the PUT in some Object...and then tells ProxyFS where it PUT it. ProxyFS then constructs the ExtentMap equivalent, fills in a FileInode, and inserts the appropriate directory entry to point to the new FileInode.

I hope I've addressed your questions...

edmc-ss commented 3 years ago

You can actually see the innards of how ProxyFS stores data by using the "NoAuth" Swift Proxy serving localhost:8090. Do an Account GET and you will see the checkpoint Container (sorry...having a hard time avoiding markdown translating my underscores as requesting boldface) as well as the other Containers used to hold file data. The Objects in each Container are merely 64-bit Hex named...so somewhat hard to know what file data is stored within them. That's what that extent-map URL is helpful for.

Finally, if you do a HEAD on that Checkpoint Container, you should notice a four valued Header. Those four values make up what file system people would call the "superblock"...basically...the root of the file system. The four 64-bit Hex numbers are:

Version/Format# - in this case "3"

LastCheckpointObject - the name of the Object in the Checkpoint Container where we wrote the last Checkpoint of the file system.

LastCheckpointSize - the length ...at the "tail" end, where we wrote the Checkpoint

ReservedToNonce - the next 64-bit number to use when uniquely naming an Object. Recall that we never want to reuse Object names lest we be unsure a successful GET actually returned the most recent Object Contents? This is how we do that.

If you do an Account GET, you should also see that Header that tells pfs_middleware this Account is "BiModal".

That's it in a nutshell...

tabeamiri commented 3 years ago

oh great, now I know the basics about proxyFS. and one more question, which is the difference between pfs_middleware and meta_middleware? you described about pfs_middleware and I got it completely. but meta_middleware, no details. should they be both activated in proxy along side proxyFS ? by the way, would you mind sending me an invitation email for your Slack channel? It seems there is no direct way to join, except with invitation link. thank you dear edmc-ss

edmc-ss commented 3 years ago

When ProxyFS is deployed in a Swift cluster, there are actually TWO Swift Proxy pipelines. One is the full Swift Proxy that includes all selected middleware filters as desired... typically including some auth filter that restricts access to Accounts to only authorized Users. To this Swift Proxy, we insert the _pfsmiddleware previously described. This way, Swift & S3 requests that arrive at this client visible Swift Proxy gain support for accessing ProxyFS volumes. Indeed, if you don't add __pfsmiddleware, users will be exposed to the underlying raw storage of ProxyFS. That's akin to letting users open up a raw disk device and read/write to the sectors of the drive. Not a good idea. Importantly, this _pfsmiddleware needs to be placed in every client accessible Swift Proxy to prevent such raw access lest clients corrupt the file system.

Meanwhile, ProxyFS needs to perform its own GETs and PUTs to Swift. Necessarily, it needs this very raw access. Indeed, ProxyFS is a trusted element so also does not need the auth filter present in the client visible Swift Proxy. As such, we refer to this second class of Swift Proxy as the NoAuth Swift Proxy. Given that the NoAuth Swift Proxy does not require authentication, it necessarily cannot be visible to clients. So it should be configured to only listen on localhost. We only need a NoAuth Swift Proxy on each of the nodes that also host a ProxyFS instance.

So that's the difference between the two Swift Proxy types... but you also asked about the _metamiddleware filter. Well, it turns out that ProxyFS wants to place a special Header in the Account of the ProxyFS volume. Importantly, we don't want normal users to be able to modify this Header as it is what indicates that the Account is managed by ProxyFS lest, again, users gain raw access to the Account and potentially corrupt the file system. To protect this particular Header, we need to work around the namespace protections provided in the normal Swift Proxy pipeline. To implement this workaround, we insert the _metamiddleware in the NoAuth Swift Proxy pipeline.

So, to be clear & summarize:

You need both the normal Swift Proxy and this new NoAuth Swift Proxy in your deployment.

You insert _pfsmiddleware in the client visible / normal Swift Proxy pipeline.

You insert _metamiddleware in the localhost-only NoAuth Swift Proxy pipeline.

If you look in the saio/ directory, you will see examples of all this. In particular, check out the normal Swift Proxy config and the NoAuth Swift Proxy config. In the "normal" Swift Proxy config, notice how _pfsmiddleware (i.e. filter pfs) is inserted right after dlo. Similarly, in the NoAuth Swift Proxy config we insert _metamiddleware (i.e. filter meta) right after dlo.

Hoping this clears up these great questions.

edmc-ss commented 3 years ago

I also invited you (amiritayebeh@yahoo.com) to the ProxyFS Slack Channel. Not quite sure why an invite was necessary, but hopefully it makes the process simple now...

tabeamiri commented 3 years ago

I really appreciate it, that solved my problem and I joined to your channel. :)