Fix lifecycle and ControllerUnpublishVolume description

YuikoTakada commented 1 year ago

What type of PR is this? This PR adds new transition which move from PUBLISHED / VOL_READY to CREATED directly by ControllerUnpublishVolume RPC when control plane can't reach the node(for example node shut down due to a hardware failure or a software problem).

What this PR does / why we need it: Current CSI volume lifecycle is not designed for the case when Node is unreachable.

When Node is shutdown or in a non-recoverable state such as hardware failure or broken OS, Node Plugin cannot issue NodeUnpublishVolume / NodeUnstageVolume . In this case, we want to make the status CREATED (volume is detached from the node and pods are evicted to other node and running) But in current CSI volume lifecycle, there is no transition from PUBLISHED / VOL_READY / NODE_READY to CREATED. As a result, k8s doesn't follow the CSI spec and the status moves from PUBLISHED to CREATED directly without going through VOL_READY and/or NODE_READY status.

We need to update the CSI volume lifecycle with considering the case when Node is unreachable.

Which issue(s) this PR fixes:

Fixes #512

Special notes for your reviewer:

Does this PR introduce an API-breaking change?:

none

YuikoTakada commented 1 year ago

@jdef This PR fixes #512 . Please take a look.

jdef commented 1 year ago

(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume? (i.e. via some flag field in the request message) this could possibly be a valuable signal to the SP that additional validation/GC/state-repair needs to happen on the backend. perhaps SP authors can chime in here.

(b) this is a breaking change w/ respect to SP expectations re: volume lifecycle. it may make sense to guard this w/ a specific capability in the controller service. and only then should a CO have expectations that such state transitions are reliable. otherwise SPs may receive calls in an unexpected/unsupported order. k8s isn't the only CO on the block, and just because it breaks the rules currently doesn't mean that all SPs need to support that pattern.

YuikoTakada commented 1 year ago

There are COs of which implementation violates csi spec already. They violate csi spec reluctantly because there is no transition in case of node is shut down or in a non-recoverable state. Matching actual implementation and spec makes sense.

jdef commented 1 year ago

reluctant COs or otherwise, this change seems to break the contract defined in the spec. it's unclear whether my proposals are under consideration. do you have any feedback re: the suggestions I made?

yosshy commented 1 year ago

This PR looks having room to describe more detail, but I believe that its direction is right because:

The CSI spec lacks consideration for forced volume detachment on a failed node.
K8s already has an undocumented (but reasonable) ControllerUnpublishVolume call to detach an attached volume to a failed node.
CSI plugins including NetApp Trident allow 2.

See logs for details.

If someone appends ANOTHER call like #477 to the spec for the case, we will have to modify K8s and CSI plugins supporting 2. We don't want to change them because they are working very well now. So, it's good to modify the spec to follow the current as-is workflow. This PR will do so.

tgross commented 1 year ago

(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume? (i.e. via some flag field in the request message) this could possibly be a valuable signal to the SP that additional validation/GC/state-repair needs to happen on the backend. perhaps SP authors can chime in here.

With Nomad our experience is the primary reason we need to bypass the proper cleanup calls is because we can't communicate with the node plugin (ex. it's host isn't running). Right now we don't track the failure state specifically and just log the event and bump the internal state machine along, but it's totally feasible for us to implement. For SP plugins like AWS EBS, the controller could use that information to force-detach instead of detach.

(b) this is a breaking change w/ respect to SP expectations re: volume lifecycle. it may make sense to guard this w/ a specific capability in the controller service. and only then should a CO have expectations that such state transitions are reliable. otherwise SPs may receive calls in an unexpected/unsupported order. k8s isn't the only CO on the block, and just because it breaks the rules currently doesn't mean that all SPs need to support that pattern.

As a CO implementer I'd suggest that if we were to have such a capability, the spec should provide specific guidance to what the CO and SP should expect as to the correct behavior if that capability is missing. Because by a strict reading of the current spec if the node plugin is unavailable the state machine cannot continue and the volume is unrecoverably lost from the perspective of the CO. Needless to say this is an awful user experience. So if the spec is going to be particular about it I'd love to be able to point to the plugin author who isn't providing that capability so that we can suggest to our users that they use a better plugin. :grin:

K8s already has an undocumented (but reasonable) ControllerUnpublishVolume call to detach an attached volume to a failed node.

fwiw, Nomad implements something similar as well.

YuikoTakada commented 1 year ago

@jdef

(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume? (i.e. via some flag field in the request message) this could possibly be a valuable signal to the SP that additional validation/GC/state-repair needs to happen on the backend. perhaps SP authors can chime in here.

Thank you for giving me good advice. I also think your idea "CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume" and "i.e. via some flag field in the request message"is good. I've updated just figure so PTAL. If you think this change is good, I'll update ControllerUnpublishVolume description also.

bswartz commented 1 year ago

It's true that today k8s already violates the spec, but only in situations where it's known to be safe. The workaround that exist to ensure safety when doing this still rely on external knowledge, though, which means they either need a human in the loop (not suitable for lights out operation) or they rely on some external component with special access to machine state (not available in all deployments).

I'm going to put some more effort into bringing back my proposed solution in #477. The remaining objections seem to be related to the fact that the proposal only covers the CSI layer and people want to a proposal that covers the upper layers too. To this end I will write up a KEP that describes a comprehensive proposal.

jsafrane commented 1 year ago

(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume?

Yes, as mentioned above some cloud-y CSI drivers could invoke "force detach". Some more traditional (FC, iSCSI) drivers could at least try to clean up LUN mapping on the server (target) side, because the client (initiator) is in down / unreachable.

jsafrane commented 1 year ago

(b)) this is a breaking change w/ respect to SP expectations re: volume lifecycle. it may make sense to guard this w/ a specific capability in the controller service. and only then should a CO have expectations that such state transitions are reliable. otherwise SPs may receive calls in an unexpected/unsupported order. k8s isn't the only CO on the block, and just because it breaks the rules currently doesn't mean that all SPs need to support that pattern.

As a CO implementer I'd suggest that if we were to have such a capability, the spec should provide specific guidance to what the CO and SP should expect as to the correct behavior if that capability is missing. Because by a strict reading of the current spec if the node plugin is unavailable the state machine cannot continue and the volume is unrecoverably lost from the perspective of the CO.

As another CO implementer (Kubernetes), I have the same point. Only human can bring the volume back to a working state, because CO does not have any message that would indicate SP to recover the volume from such a state.

YuikoTakada commented 1 year ago

@bswartz Thank you for your comments. If you can refine #477 or create another new PR, I'm glad to discuss on it.

YuikoTakada commented 1 year ago

@bswartz I read #477 again. #477 suggests to create new states QUARANTINED_SP and QUARANTINED_S. This means that if #477 will have been merged, COs (of cource kubernetes is included also) will not work well and many COs need to be updated. It makes big impact. Could you please consider to refine #477 which doesn't make new state, just making CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume(i.e. via some flag field in the request message) ? Thanks!

yosshy commented 1 year ago

I'm going to put some more effort into bringing back my proposed solution in #477. The remaining objections seem to be related to the fact that the proposal only covers the CSI layer and people want to a proposal that covers the upper layers too. To this end I will write up a KEP that describes a comprehensive proposal.

We found that k8s calls NodeStageVolume and NodePublishVolume for attached volumes on a restarted node. The status of attached volume is "PUBLISHED" and it's another violation of the state transition diagram in Fig.6, but it's reasonable behavior for COs because attached volume might be detached before node restart. Please consider them when you will update #477.

BTW, #477 looks storage fencing, the first plan of non-graceful shutdown in k8s but it was finally changed to use node fencing. Is my understanding incorrect?

bswartz commented 8 months ago

We intend to fix the broken behavior in Kubernetes: https://github.com/kubernetes/kubernetes/pull/120344

Initially the fix will be opt-in, but eventually the old spec-violating behavior will be removed entirely. Based on that plan, this PR is no longer needed.

YuikoTakada commented 8 months ago

@bswartz Thank you for your notice. I'll check it.

container-storage-interface / spec

Fix lifecycle and ControllerUnpublishVolume description #533