Recursive mount-like behaviour for distributed 9p servers

dmorn commented 4 years ago

Background

I'm writing a piece of software that given a task description spawns a server running that task somewhere (usually in a containerised environment, we need to scale up and down easily and that makes it easier to handle).

I'm modelling the thing like this: each worker that is executing the task remotely is called a "process", which is a 9p server exposing 4 files: ctl, retv, status (which might become log), err. This way the process can just be mounted and inspected. There is a server that is providing this service, I call it "flexi". It is a 9p server too: it exposes a ctl file where clients can write task descriptions to (in my case, json encoded payloads).

The idea is flexi reads the task descriptor, starts the remote process and returns a number to the client, describing the folder that is containing the processes mounted fs (pretty much how you take care of TCP connections under plan9).

This fs is an example of a flexi fs with a running process:

flexi
├── 0
│   ├── ctl
│   ├── err
│   ├── retv
│   └── status
└── ctl

Main

I looks like I cannot recursively mount 9p servers using the osxfuse driver. I'm thinking about making flexi forward 9p messages to the process's server when necessary, in a man-in-the-middle proxy style to work around the problem. What do you think @droyo? Do you have suggestions/alternative approaches? Thank you for the library btw 😊👍

marzhall commented 4 years ago

Just a thought from the peanut gallery: it somewhat breaks the abstraction, but could you mount the other 9p servers of running jobs in a different directory? Maybe pass flexi some other directory on the machine when it starts up, or have flexi make a temp directory and inform that user of the directory, and have it so that flexi mounts new jobs in that directory and both flexi or yourself can look at/control the clients from that folder. It loses the intuitive 'nesting' you'd get from presenting the remote folders in flexi's FS and proxying the communication, but may be simpler. Otherwise, the proxying seems a reasonable approach, at least to me. In fact, it might even be a nice thing to make a library.

dmorn commented 4 years ago

This is what I tested as first approach indeed! I'm a little bit concerned about leaking umounts, but I do want to keep that as an open option. I'm more into the second one, let's see if we can make a library out of it! 😊 I'm not sure where the proxying should take place though. Would it be a sane approach to use this "proxy" on a per request basis? Maybe it would be nice to hijack the request to allow the proxy to just relay messages in a decode-encode (and vice-versa) fashion (trimming and restoring paths)!

droyo commented 4 years ago

I think it will be tricky to do with this package being the way it is today. I would use the styxproto package directly, since all of the "help" the styx package tries to do keeping track of fids and sessions could just get in the way.

You'll need to intercept all Tattach and Tauth messages and respond to them directly. Whether or not you want a 1:1 relationship between user -> proxy sessions and proxy -> backend sessions is up to you.

You have to intercept all Twalk messages and strip a prefix from them, and you'll need to intercept all Rwalk messages from the backend and prepend a prefix to them. Since Twalk is the only way to create new Fids, you can also intercept the Twalks to populate a mapping from Fid -> backend session. You'll need to intercept Tclunk and Tremove requests to remove items from the map. You can then create an interface to intercept anything with a Fid and redirect it:

type Fcall interface {
    Fid() uint32
}
func relay(d *styxproto.Decoder) {
    for d.Next() {
        switch m := d.Msg().(type) {
        case styxproto.Twalk:
            // store newfid in a pending map, commit after you
            // see an Rwalk response from the backend
        case Fcall:
            fid := m.Fid()
            if w, ok := fidToEncoder[fid]; ok {
                styxproto.Write(w, m)
            }
        }
    }
}

Another tricky part is that you'll need to keep track of the request tags so that you can route Tflush commands to the appropriate server and so you can match Rwalk and Rerror messages to the appropriate Twalk to know if the new fid is valid.

Outside of those corner cases you should be able to act as a dumb pipe, callingstyxproto.Decoder.Next on the client-facing connection, then styxproto.Write on the appropriate backend connection to relay messages, and doing the reverse to relay responses.

It sounds difficult enough and generic enough that it might make a good addition to the library.

dmorn commented 4 years ago

Thank you @droyo and @marzhall for the guidance. In the next few days I'll try to make an implementation within the flexi project that we can easily extract and integrate in the library, if we want to. I'll keep you up to date!

okvik commented 4 years ago

I would steer away from layer-breaking proxying unless or until you find that it absolutely is needed for performance, or something — which I doubt you will.

An alternative is to let flexi interface with worker file servers exactly as a regular client would, that is, mount them in its namespace and map the requests it gets from clients to matching file operations on worker trees, then translating (copying) the results to replies to clients, which won't notice better.

Apart from mapping the walks and handling directory reads to export the tree as you want it this is trivial to implement and it'll simply continue working without any change if you happen to change the worker file API—flexi doesn't even need to know anything about it.

Examples of this approach are Plan 9 exportfs(4) and—shamelessly—unionfs(4).

dmorn commented 4 years ago

Hi @okvik, thank you for joining the conversation. I do agree that this issue should be addresses by mounting the remote fs within flexi's fs. mount under plan9 is actually converting local 9p messages into RPC if I'm not wrong, so that would the proxy we're talking about. Being practical though, we cannot recursively 9 mount under macos using the osxfuse driver as far as I know! So, how do we want to proceed?

okvik commented 4 years ago

I'm not sure I understand why 'recursive mount' is needed here, or what exactly you mean by it. I'm not familiar with the FUSE 9P driver limitations but I can't imagine it being too much of a roadblock for the approach I'm suggesting. Perhaps I misinterpreted the problem to start with so let me summarize my assumptions here for the mutual benefit.

You have some set of machines that you wanna run stuff on. These machines are running a 9P server which provides an interface through which workload can be assigned to the process (as you call it) and through which progress and results may be returned. For the sake of discussion let's call this file server 'worker'.

Its exported file tree looks something like this:

worker/
    ctl
    err
    retv
    status

You've got loads of these that you want to (spawn?), collect, and control, all under a higher-level interface, another 9P server — which, if I'm not mistaken, would be flexi, a sort of orchestrator.

With two workers under its wing, flexi file tree may look something like:

flexi/
    ctl
    0/
        ctl err ...
    1/
        ctl err ...

You could do as you intended — try filtering and rewriting, and then routing the bare 9P flow destined for subtrees towards the target workfs, but I don't think this would be easy to get right at all and I highly doubt there's any benefit to doing it that way.

Instead, let flexi mount the workers trees just as you would usually, let it be a normal client for them:

mkdir -p workers/{0,1,2}
9 mount 1.2.3.4:123 workers/0
9 mount 1.2.3.4:312 workers/1
9 mount 1.2.3.4:231 workers/2
...

Now when flexi's clients try to Twalk to its directory like 878/ it needs to map that to its local workers/878/ mountpoint and just check if there's anything mounted there; same for a walk to a file like 878/ctl where you'd check for file's existence. Then the client may ask to Topen this file, which you'd go try and do with open(2) on workers/878/ctl, saving an open descriptor for when the client next asks you to Twrite to it, in which case you'd perform a matching write(2) call on the descriptor, and so on. You'd do very much the same for Tread, Tclunk (close), Tcreate, Tremove, Tstat, Twstat, though you probably don't need anything past Tclunk in this case.

And that's really all there is to it. This is exactly how exportfs(4) in Plan 9 works, and unionfs(4) is very much like it except it tries to walk into multiple trees looking for the same file name, which effectively results in presenting their union.

I hope the idea is a bit clearer now.

dmorn commented 4 years ago

I think this solution is somehow similar to what @marzhall was suggesting with this comment right? There is just one thing I don't like about this approach: we mount a separate workers directory outside flexi's own tree. This is where "recursive mount" comes from: I was wondering wether we could do as you suggest, but instead of mounting the remote processes inside a separate directory (being workers in this case) mounting straight inside flexi's mountpoint. Do you see my point?

dmorn commented 4 years ago

I mean, under plan9 even the fs root point (/) is obtained from a 9p server (right?): that would mean when I mount something in my namespace, I'm recursively mounting a 9p namespace inside another one. I wanted to make flexi follow this behaviour!

okvik commented 4 years ago

Indeed his comment describes the same thing.

that would mean when I mount something in my namespace, I'm recursively mounting a 9p namespace inside another one.

No, 9p servers are independent entities tasked with providing their and only their file tree(s) which can be incorporated, or mounted into "the namespace".

The namespace is a mechanism provided by an operating system. It is a system of names and operations on names.

In UNIX the namespace is a familiar global mount namespace with mount and umount operators. In Plan 9 this is extended to private process-group namespaces with additional semantics such as mountpoint unions.

The namespace is a concept external to 9p, and in particular external to serving 9p.

There is no way to directly expose the namespace over 9p, not in Plan 9 and imaginably less so elsewhere. You need to use exportfs(4) (or similar) to export a 9p interface to a part of your local namespace.

So, there is no such thing as 9p servers recursively mounting other 9p servers into their own exported namespace, not in the direct sense that you are thinking of.

Now, nothing prevents you from emulating Plan 9 namespace semantics and operations in form of a 9p server! You could well implement namespacefs which provides a '/' on top of which you can layer 9p client connections and mounts and unions. But note well: this will be just another external mechanism of which your 9p file servers are totally ignorant of. :)

dmorn commented 4 years ago

No, 9p servers are independent entities tasked with providing their and only their file tree(s) which can be incorporated, or mounted into "the namespace".

Thank you @okvik for the clarification 😊 I'm still new to plan9.

dmorn commented 4 years ago

Thank you all for your precious help! I'll implement flexi using the "use another mountpoint for the workers" approach then. If you are interested in the project, you'll find it at https://github.com/jecoz/flexi, just in case 😊

dmorn commented 4 years ago

Hey there, I've got some updates and I think it is relevant to continue the discussion here, feel free to stop me if you don't think so. We (our company) need to drop the fuse driver dependency as docker containers seek it from their outer environment, making it difficult for us to distribute the workers on some cloud providers. Do we want to discuss a little bit on the alternative solutions we proposed above?

From how I see it, the cleaner approach would be to create a styx client implementation (within this library) that can be used programmatically to forward fs functions calls to an io.ReadWriteCloser (which might be a TCP connection or a posted service file on the local namespace, i.e. a unix socket under 9port). What do you think @marzhall @droyo @okvik ?

Please also have a look at how the flexi.FS is used in conjunction with the FSHandler, a styx.Handler implementation. I think we might really take advantage of this abstraction within the library itself somehow

droyo / styx

Recursive mount-like behaviour for distributed 9p servers #25

Background

Main