danilop / yas3fs

YAS3FS (Yet Another S3-backed File System) is a Filesystem in Userspace (FUSE) interface to Amazon S3. It was inspired by s3fs but rewritten from scratch to implement a distributed cache synchronized by Amazon SNS notifications. A web console is provided to easily monitor the nodes of a cluster.
http://danilop.github.io/yas3fs
MIT License
644 stars 98 forks source link

Support for POSIX file locking #155

Open mcassaniti opened 7 years ago

mcassaniti commented 7 years ago

I'm curious if YaS3fs could be extended to support file locking. FUSE does support "POSIX advisory record locking", and I was wondering if it's feasible to extend the code to support this. My thoughts are:

  1. Each node participating in a cluster would need to store the locked state of a file. This would simply be new values in the current cache code, and would need to store enough information to note the byte range of a lock, along with the type. I don't know if FUSE will supply the PID or not, but I assume that it will.
  2. The messaging code between nodes would need to be extended to include new cache update messages for locking. This will need to include a lock identifier that is based on node ID alone, since other nodes in the cluster don't care what PID on a remote node is holding the lock.
  3. File operations would need to check for lock conflicts between PIDs and prevent particular file operations from occurring as required.
  4. The case of a disappearing node in a cluster would need to be handled. If a node was holding locks on behalf of a process and the node suddenly died, the locks would remain forever across the cluster.

The first 2 tasks don't sound overly complex. The 3rd task will require a bit more effort, but once the details are nailed down with a few proper test cases then things should be fine.

The last task is the problem task. I would suggest that locks within the cluster are time based with an expiry, and a node holding locks needs to refresh locks before expiry.

For example, lets say that a single node receives a message noting that the first 100 bytes of file 'lockme' has a read lock held. The receiving node will add that lock to the cache and add a local timestamp for the received lock OR will just update the timestamp if the lock details already exist. A separate thread will run on the sending side to continue to send those lock details out at half the expiration time (lets say expiration is 5 seconds), and the lock checking code on the receiving side will automatically remove the lock from the cache if the timestamp is older than 5 seconds (lazy removal).

My question is, does this sound feasible? I'm not so much interested on whether a developer wishes to implement this change, but whether it would be possible and what would be required.

dacut commented 7 years ago

It's not trivial, but it is doable in theory. Devil -- as it always is with distributed locking -- is in the details. I've seen a few too many projects design themselves into a corner and ultimately end up abandoned. (Paxos is hard to get right. Really, really, hard...)

Rather than try to reinvent the wheel here, I would look at leveraging another system that already does some of this. Zookeeper comes up frequently. It's a neat system, but I don't know of anyone who enjoys maintaining it. Redis comes close (I'd love to be able to use a hosted system here), but doesn't guarantee consistency -- which is usually a non-starter for filesystems.

mcassaniti commented 7 years ago

If we're going to that level, then my thoughts on options are:

  1. Use an existing Raft implementation, there's a few Python libraries available after doing a quick GitHub search. Since the library is independent of the transport, it should be possible to use SQS or the HTTP listener for a transport layer.
  2. Use etcd from CoreOS as an established Raft server, and add options to YaS3fs for using etcd for locks. Since etcd supports TTLs on data, it would work for timing out locks. Additionally, other nodes could subscribe to events from etcd, allowing a blocking wait on a lock to be notified on as soon as the lock expires or is removed.

Thanks for the feedback so quickly. Naturally the devil is in the details.

dacut commented 7 years ago

Hadn't thought about etcd, but it does sound like an interesting approach. I discovered Raft last night as I started thinking about "How do folks do Paxos these days?", and I started reading the paper behind it (but then had to get back to Real Work™). John Ousterhout... ah, the good ol' Tcl days.

One constraint: depending on your design, your underlying locking mechanism may need to support fancier queries than simple key-value lookups. For example, client A has locked bytes 1000-2000 of file F1; now client B wants to lock bytes 500-1500 of file F1, but needs some way to discover the lock from A. To-end-of-file locks present an interesting edge case when the file is appended.

Deadlocks vs. timeouts are a potential issue. AFAIK, no filesystems try to avoid a deadlock (A locks F1; B locks F2; A then tries to lock F2; B then tries to lock F1). Today, the processes just hang. However, if we time these locks out, they might continue on -- with data corruption becoming a possibility here. I don't think I'd try to solve it -- just mark it as a "User Beware" issue if you set the lock timeout to a finite value, or keep the locks fresh though constant pinging ("Yes, client is still here, don't expire the lock").

Note that I'm not signing up to do this -- at least not right now. (Waaay too much on my Real Work plate.) But I'm more than happy to help guide folks who might be interested in playing with this.