Open sfc-gh-abeamon opened 3 years ago
I believe this was already fixed with the ioDegradedOrTimeoutError
function. It will throw an io_error if the tlog cannot commit for a long time.
I agree that does seem like it would help, and I think the situation we saw here may have been on a version prior to this change. That said, the timeout for that error is much longer at 2 minutes, and so if we can improve our reaction time to this process being degraded it would still be beneficial.
Also, it seems that we don't use ioDegradedOrTimeoutError
in all of the old tlog implementations. Instead, it's just the current one and 6.0.
When a transaction log is unable to commit or do a few other things during its local recovery, it gets marked degraded. This status gets reported to the cluster controller, who would then attempt to recruit a new transaction subsystem without any degraded logs.
If a log gets reported degraded during recovery, though, and that degradation prevents the recovery from completing, then the cluster controller will not try to replace it. If I understand correctly, this is because
betterMasterExists
does not attempt to reevaluate the cluster layout if it is not sufficiently recovered:https://github.com/apple/foundationdb/blob/5a5f724d9c7f1c1fac47a610264effc4b44d300e/fdbserver/ClusterController.actor.cpp#L2223
This behavior was observed in 6.2, and while the line above still exists I'm not sure if this is impacted by some of the other newer changes to the degradation logic.