Open gnufied opened 5 years ago
cc @jingxu97 @jsafrane @msau42
IMO this is covered in "Timeouts" chapter: https://github.com/container-storage-interface/spec/blob/master/spec.md#timeouts. It IMO applies not only to timeouts, but also to similar errors like interrupted gRPC connections, where the caller cannot be sure how the call ended and must either retry (in most cases) or cancel (when the volume does not need to be staged/published any longer).
There is lot of gray areas in:
In some cases, a CO MAY NOT be able to cancel a pending operation because it depends on the result of the pending operation in order to execute the "negation" call.
and it might be best for specs to be clearer in this aspect.
But I think for now we can work with the assumption that NodeUnstage
can be issued to cancel a previously in-progress NodeStage
and similarly NodeUnpublish
can be issued to cancel a previously in-progress NodePublish
. Although exact semantic of whether an operation can be cancelled or not depends on what SP does in NodeStage
and NodePublish
calls.
I agree that this is very unclear. The real issue arises when we cannot "cancel" the NodeStageVolume/NodePublishVolume requests, what is the correct thing to return? Can the SP return Pending(Aborted) in the case that NodeUnstageVolume/NodeUnpublishVolume cannot cancel the request? It is not super clear from the documentation if that is a valid option.
We have come across an issue where the CSI spec does not offer enough clarification about what happens if
NodeUnstageVolume
is called whileNodeStageVolume
is in-progress for same volume and similarly forNodePublishVolume
andNodeUnpublishVolume
.Lets say - user schedules a workload to node A, but
NodeStageVolume
may take time and before it has chance to finish, the workload may get evicted from node A. Now two things can happen to the volume that was staged on node A:NodeStageVolume
may just be taking time and CO can wait for it to finish successfully before callingNodeUnstageVolume
. The spec currently says:NodeStageVolume
may never succeed (because of error or topology constraints) and CO might keep retrying but is CO allowed to makeNodeUnStageVolume
call whileNodeStageVolume
may be in-progress?I think we need to codify this in a better way.