Open mcassaniti opened 7 years ago
It's not trivial, but it is doable in theory. Devil -- as it always is with distributed locking -- is in the details. I've seen a few too many projects design themselves into a corner and ultimately end up abandoned. (Paxos is hard to get right. Really, really, hard...)
Rather than try to reinvent the wheel here, I would look at leveraging another system that already does some of this. Zookeeper comes up frequently. It's a neat system, but I don't know of anyone who enjoys maintaining it. Redis comes close (I'd love to be able to use a hosted system here), but doesn't guarantee consistency -- which is usually a non-starter for filesystems.
If we're going to that level, then my thoughts on options are:
Thanks for the feedback so quickly. Naturally the devil is in the details.
Hadn't thought about etcd, but it does sound like an interesting approach. I discovered Raft last night as I started thinking about "How do folks do Paxos these days?", and I started reading the paper behind it (but then had to get back to Real Work™). John Ousterhout... ah, the good ol' Tcl days.
One constraint: depending on your design, your underlying locking mechanism may need to support fancier queries than simple key-value lookups. For example, client A has locked bytes 1000-2000 of file F1; now client B wants to lock bytes 500-1500 of file F1, but needs some way to discover the lock from A. To-end-of-file locks present an interesting edge case when the file is appended.
Deadlocks vs. timeouts are a potential issue. AFAIK, no filesystems try to avoid a deadlock (A locks F1; B locks F2; A then tries to lock F2; B then tries to lock F1). Today, the processes just hang. However, if we time these locks out, they might continue on -- with data corruption becoming a possibility here. I don't think I'd try to solve it -- just mark it as a "User Beware" issue if you set the lock timeout to a finite value, or keep the locks fresh though constant pinging ("Yes, client is still here, don't expire the lock").
Note that I'm not signing up to do this -- at least not right now. (Waaay too much on my Real Work plate.) But I'm more than happy to help guide folks who might be interested in playing with this.
I'm curious if YaS3fs could be extended to support file locking. FUSE does support "POSIX advisory record locking", and I was wondering if it's feasible to extend the code to support this. My thoughts are:
The first 2 tasks don't sound overly complex. The 3rd task will require a bit more effort, but once the details are nailed down with a few proper test cases then things should be fine.
The last task is the problem task. I would suggest that locks within the cluster are time based with an expiry, and a node holding locks needs to refresh locks before expiry.
For example, lets say that a single node receives a message noting that the first 100 bytes of file 'lockme' has a read lock held. The receiving node will add that lock to the cache and add a local timestamp for the received lock OR will just update the timestamp if the lock details already exist. A separate thread will run on the sending side to continue to send those lock details out at half the expiration time (lets say expiration is 5 seconds), and the lock checking code on the receiving side will automatically remove the lock from the cache if the timestamp is older than 5 seconds (lazy removal).
My question is, does this sound feasible? I'm not so much interested on whether a developer wishes to implement this change, but whether it would be possible and what would be required.