Open mpilman opened 5 years ago
This is an area where we're actually doing some thinking as well. Although for us, the main problem has been when this happens on the transaction logs. A stalled transaction log means that you can't commit to the database, which is obviously pretty bad. I wouldn't expect a stalled storage server to result in the cluster going down on its own, but it could lead to indefinite spilling and a higher susceptibility to problems in the event of other failures. Do you have any details about how this process brought the cluster down?
At any rate, one of our current ideas was to have a process detect when it has a condition like what you described (i.e. can't do IO for some period of time) and register in a degraded state with the cluster controller. In that state, the cluster controller would prefer other processes for recruiting roles on and would only resort to degraded processes if needed. This is perhaps a little more straightforward with non-storage roles where we aren't recruiting them everywhere possible.
The main thing we want to avoid is a bunch of processes making the independent decision to term themselves off and then losing availability. Perhaps in the case of storage servers, we could have a budget for these processes that we'll let drop out, after which the cluster controller won't allow them to leave anymore. @etschannen probably would have some thoughts as well.
First I need to clarify: the broken disk didn't bring the cluster down, it just resulted in the TLogs growing. Without human intervention it would've brought the cluster down (remember we run on FDB 3 which doesn't have tlog spilling - but even with spilling the disks on the tlogs would've eventually run full).
Honestly, if a process (storage or TLog) detects a disk-issue I would rather have them going down. Obviously we prefer not to have an outage, but an outage is very much preferred over data loss or data corruption. The way we currently operate is: if a disk has issue we immediately remove the process and add it only if we either replaced the disk or verified that the disk is fine. This means we will most probably have an outage if more than 2 disks fail simultaneously.
But just in general: a system that uses potentially broken disks sounds very very scary.
but an outage is very much preferred over data loss or data corruption.
May I ask when will data corruption happen in this situation? One scenario I may think of is: the disk replies SUCCESS for the write operation, while the operation is actually Failed or Corrupted. Can this really happen? (If it can, it sounds scary.)
My understanding about this problem is that the disk never returns for the operation and blocks the following operations. This should not increase the risk of data corruption or data loss, except that the data cannot be persistent on the problematic disk.
I should have also mentioned that part of the reason we don't usually have this sort of problem with storage servers is that we use a parameter called --io_trust_seconds
that has the effect of timing out reads and writes that take longer than 20s, resulting in the process restarting (or becoming a zombie, I don't remember which). It doesn't exactly avoid the problem entirely, because many IO operations aren't specifically calling read or write, but it usually results in the process dying before too long. It was actually designed for a different purpose, which is why it's not a perfect fit for this problem.
I think we'd be pretty hesitant to have processes make an isolated decisions to terminate for the reasons I described above (especially if they aren't restarting). For the most part, disk issues have manifested for us in ways that don't pose a threat to the integrity of the data, and if they do, it's not really ideal if we're relying on the disk stalling to protect us from it (i.e. we need to be protecting ourselves from corruption in general). Of course, proactively removing disks that are demonstrating problems does reduce the risk of corruption that manifests in ways that we aren't protected against.
I don't think there's any reason why the above features I described couldn't be configurable, though. The storage server budget idea I threw out could be made to have an infinite budget. With respect to processes registering themselves in a degraded state, it seems feasible also to make it configurable whether a degraded process can actually be used. There is perhaps an extra wrinkle in the idea that this process should also not be reused if it restarts, which is a capability I think we were willing to forgo, at least in early versions.
May I ask when will data corruption happen in this situation? One scenario I may think of is: the disk replies SUCCESS for the write operation, while the operation is actually Failed or Corrupted. Can this really happen? (If it can, it sounds scary.)
We've seen a version of this where a particular type of disk would freeze, the kernel would reset some things, and then it would acknowledge some of the writes that were supposed to have taken place but didn't. This is a dangerous thing for us, because our checksums don't protect us from it. It's actually the reason we introduced --io_trust_seconds
. The timeout we set is shorter than the time it takes for the kernel task abort to happen.
So for the failure we saw in production, a restart wouldn't have helped. To be more precise: we were unable to restart the process (even manually) - kill -9
didn't do anything.
Disk failures in general: I saw disk failures in my career where a disk would just deliver random byte-strings for each request. So my general assumption is: if a disk failed, anything can happen (and there are way too many manufacturers and storage providers out there to make any general assumptions about what can and what can't happen).
To some degree, processes will always make isolated decisions. You probably would agree that a storage should remove itself if all IO operations fail. I would say that a process should also fail if it has strong reasons to believe that the disk is broken. I don't think FDB should ever value availability over consistency (or in this case data integrity)
I wildly agree: detecting disk failures can't be done by relying on timeouts. If we have enough checksuming in place, I am fine with restarts (as long as a process doesn't go into a restart loop). It also might make sense to generally have a --paranoid
option for customers like us ;)
Another part is: we sometimes see disk failures where the operating systems remounts the file system as read-only. With checksums everywhere in place we could use those storage processes in read-only mode (so basically they would help with data replication).
It's hard for me to say whether your process was wedged enough that the io_trust_seconds wouldn't have been able to help, though I'm not sure what a different mechanism would do differently to work better.
I'm also definitely not suggesting that we value availability over durability or integrity. However I think it's hard to make assumptions about whether a disk is really lost (for example, we've seen disks stall and then resume, we've also seen disks exhibit behavior like this in a non-independent way). As I said earlier, I don't think this solution solves any durability or integrity problems, so if there are any, they should be addressed directly while hopefully avoiding adding availability risk.
And I generally agree. All I am saying is that as long as there is any data on disk that isn't checksumed, I would rather be pessimistic when it comes to weird disk behavior.
That's a fair position, and I think we generally take a similar approach.
For what it's worth, since you may or may not have much experience yet with the checksumming that was added since 3.0, the ssd storage engine now has checksums on each page. There is still a susceptibility to lost writes (described above) where a write is acknowledged but not successfully completed and the disk then proceeds as normal, because the page in that case is unchanged and still valid. I'm not sure whether redwood is better in that regard.
I think the disk queue has checksums even in 3.0.
Tangentially, we've added checksums to network packets as well.
So we experienced this quite interesting failure scenario on a production cluster:
Due to a strange disk failure, an IO-operation did not return (got blocked forever). Usually disk failures result in errors, but not in this case. This local problem grew into a global one:
updateStorage
actor (think) was waiting forever on a disk operation.Now there is not a lot we can do against disk failures in general. However, a failing disk should never be able to bring down a cluster.
So I would propose the following change: If a filesystem operation takes longer than X seconds (X would need to be determined - I think something like 300 might be a good default) to complete, we should throw an error and kill the corresponding role on that process (without restarting it). That way a disk failure would simply remove that process and the data distributor could do its thing.
Any thoughts on that?