container-storage-interface / spec

Container Storage Interface (CSI) Specification.
Apache License 2.0
1.34k stars 373 forks source link

Add force detach #477

Open bswartz opened 3 years ago

bswartz commented 3 years ago

Add new controller capability:

Add new node capabilitiy:

xing-yang commented 3 years ago

@bswartz did you modify csi.proto manually? I see that you added the new capabilities in csi.proto, but I didn't find them in spec.md. Changes should only be made in spec.md. After "make", csi.proto will be updated automatically.

bswartz commented 3 years ago

@bswartz did you modify csi.proto manually? I see that you added the new capabilities in csi.proto, but I didn't find them in spec.md. Changes should only be made in spec.md. After "make", csi.proto will be updated automatically.

Seriously?!? I didn't know that! I'll update.

bswartz commented 3 years ago

If CSI puts a node in a quarantine state, which component, who (and how) should the node get marked as recovered and get out of the quarantine state?

All of the state management in CSI is implicit. Nowhere in the RPCs are the states represented explicitly. The CO needs to represent the new quarantine states for the scheme to work correctly, and the SP needs to relax its assumptions about what order some RPCs may be called in if they assert the new capabilities.

I guess that the quarantine state is per node+volume, making it possible to have other volumes (potentially from the same csi-driver) functioning correctly. Is that understanding correct?

In theory the states are per volume, although the expectation is that this feature would be used on all of the volumes on a given node at once, both for the ControllerUnpublish(fenced) and the various node cleanup functions. COs still have to track state per volume, but I would expect an implementation to move all of the volumes to a quarantine state upon detection of a node problem, and then to move all of the volumes out of a quarantine state before attaching any new volumes upon resumption of a node's participation in a cluster. That's not strictly necessary, but it's cleaner and simpler.

nixpanic commented 3 years ago

After a little more discussion with colleagues, there seems to be some further (practical) clarification needed.

With Ceph-CSI we can use this proposal when the CSI Pods use host-networking. However, if there is any other form of networking configured, the IP-addresses or hostnames of the Pods may change, and block-listing is not trivial anymore.

This spec modification may not be the right place for this discussion, but we wonder if there has been any thought about this already.

bswartz commented 3 years ago

With Ceph-CSI we can use this proposal when the CSI Pods use host-networking. However, if there is any other form of networking configured, the IP-addresses or hostnames of the Pods may change, and block-listing is not trivial anymore.

I assume you're referring here to the ControllerUnpublish portion of the solution? Nobody said this is an easy thing to implement. That's why it's an optional capability.

I feel fairly confident that the higher level problem of how to safely recover from a node failure when there are pods with RWO volumes is unsolvable without either this capability, or some cloud provider plugin that can reliably kill/reboot problem nodes.

jdef commented 3 years ago

"fencing" nodes and "forcing" detach sound like complementary features/capabilities. so, from that perspective, it's nice to see the capability bits split up.

(a) what if only fencing capability is specified? is that allowed, w/o also supporting "forcing"? if so, what's the use case? (b) what if only forcing capability is specified? is that allowed, w/o also supporting "fencing"? if so, what's the use case?

if these features cannot be disentangled from each other, then maybe they are actually part of the same capability? there's a lot to think about in this proposal, and i certainly need to spend more time w/ it.

bswartz commented 3 years ago

"fencing" nodes and "forcing" detach sound like complementary features/capabilities. so, from that perspective, it's nice to see the capability bits split up.

(a) what if only fencing capability is specified? is that allowed, w/o also supporting "forcing"? if so, what's the use case? (b) what if only forcing capability is specified? is that allowed, w/o also supporting "fencing"? if so, what's the use case?

if these features cannot be disentangled from each other, then maybe they are actually part of the same capability? there's a lot to think about in this proposal, and i certainly need to spend more time w/ it.

In situation (a) if the driver supports fencing but not force unpublish, it still gives the CO everything it needs to cope with node failures by moving workloads to other nodes and fencing off the bad node. What you would be missing in this situation is a safe way for the bad node to rejoin the cluster when it becomes responsive again. In general, the normal cleanup procedure of NodeUnpublish+NodeUnstage can't be expected to succeed under conditions where the node can't access the relevant volumes (while it's fenced) because doing so might result in data loss and a well-behaved node plugin would be expected to fail repeatedly until it can safely detach the volumes. This could be worked around be rebooting the node, though, which is what I'd recommend.

In short, situation (a) still gives the CO all the benefits of rapid recovery from node failure, but requires the use of a reboot to clean up afflicted nodes, because ordinary cleanup procedures can't be expected to work reliably.

In situation (b) the CO doesn't have any additional ways to cope with a node failure, but the force unpublish capability could still come in handy when dealing with storage controller failures. In the status quo, failure of a storage controller can result in nodes getting "stuck" because node unpublish/node unstage can never succeed, due to inability to cleanly unmount/detach the volumes while the storage controller itself is down. In a world where the CO knows about the health of the storage controller, it might choose to more aggressively clean up a node if it believes that data loss is inevitable, even if it lacks the capability to fence the node.

In short situation (b) gives the CO the capability to reliably clean up nodes AFTER data loss has become inevitable.

bswartz commented 3 years ago

Thank you for all the review feedback. I think I've addressed all the comments. Let me know when it's time to squash and merge.

bswartz commented 3 years ago

@saad-ali I expect there will be merge conflicts with the other spec changes, due to conflicting capability numbers, so we need to merge the important ones first and the followers will need to resolve the conflicts by choosing higher numbers.

bswartz commented 3 years ago

Mostly nits; this seems well thought out. I feel that it's worth reiterating that unless CO's offer an API for a user to indicate "I don't care about data loss", manual administrator effort is still involved - and then it's unclear what this API actually buys anyone. If the goal is complete automation (w/ respect to cleanup and unpinning volumes) in the case of problematic nodes, then CO's need to address this at the API level as well. What's the plan for that, and if that hasn't landed yet - why?

@jdef Thanks for the review, I will take another pass and clean up the language further by including your suggestions.

I don't agree that manual effort is still needed. The idea is that, by invoking ControllerUnpublishVolume with the fence flag set, you've already accepted the data loss, everything after that is just cleanup. This is expected to happen in situations where the node is not responsive which means that either (1) the node actually failed and the workload effectively crashed, leaving the volume in whatever state it was in at the moment the node crashed or (2) the node is still running but the workload WILL crash when it loses access to it storage (the equivalent of yanking a USB drive out while an application is writing to it).

The real question is under what circumstances a CO might choose to make this call, and that IS a more challenging question (answered below), but I think we should at least be able to agree that once the call is made, there's no recovering the lost data, at that point we're cutting our losses and trying to restore the system to a usable state in an automated way, which is why the force node unpublish/unstage are critical.

The answer the question of why anyone would make the ControllerUnpublishVolume call with the fence flag set, we have to think about workloads that highly value uptime. Historically, these application have been a poor fit for container orchestrator systems like Kubernetes and people have run them in extremely expensive compute clusters. Now that people are trying to run these kinds of applications in container orchestrators, they're looking for ways to achieve similar uptime guarantees that they were able to get in the old world. The reality is that nodes do fail, and while workloads can migrate, shared state makes moving the workload unsafe unless you're SURE the old node is dead and not merely uncommunicative.

Clustered systems have historically solved this with a technique called STONITH (Shoot The Other Node In The Head) which kills the node whether it's dead or alive, making it safe for shared state to be taken over by a new node. There are various ways of achieving the shooting of the other node, including cutting its power (if you have programmatic access to the power supply), or killing the VM running the node (if you have access to the hypervisor) which are both perfectly valid. This CSI spec proposal adds a 3rd way which relies only on access to the storage device and the storage device having the capability to fence a node (not all do, but it's an optional capability). COs are then able to choose the most appropriate technique to make nodes definitely dead when they're trying to rapidly migrate a workload from a node of questionable state to a healthy node.

Presumably COs will also allow policy settings that allow end users to specify their preference for either aggressively high uptime or a more restrained response that allows workloads on dead nodes to stay there in hopes that they might still be alive. In no case does this need to be a manual response. The policy can be set up front and the system can execute it, fencing workloads for which the policy says to prefer uptime very quickly after the node becomes unreachable.

jdef commented 3 years ago

Thanks for the detailed response. Can you provide a policy example of how a CO allows a user to state that they are OK with data loss? Because otherwise this feels like "build it and they will come" .. but maybe they won't because users aren't asking for this? Or maybe they are.. can you share?

It's a cool feature, and I don't see how CO's are ready for it. Do you?

On Sat, Jun 5, 2021, 9:44 PM Ben Swartzlander @.***> wrote:

Mostly nits; this seems well thought out. I feel that it's worth reiterating that unless CO's offer an API for a user to indicate "I don't care about data loss", manual administrator effort is still involved - and then it's unclear what this API actually buys anyone. If the goal is complete automation (w/ respect to cleanup and unpinning volumes) in the case of problematic nodes, then CO's need to address this at the API level as well. What's the plan for that, and if that hasn't landed yet - why?

@jdef https://github.com/jdef Thanks for the review, I will take another pass and clean up the language further by including your suggestions.

I don't agree that manual effort is still needed. The idea is that, by invoking ControllerUnpublishVolume with the fence flag set, you've already accepted the data loss, everything after that is just cleanup. This is expected to happen in situations where the node is not responsive which means that either (1) the node actually failed and the workload effectively crashed, leaving the volume in whatever state it was in at the moment the node crashed or (2) the node is still running but the workload WILL crash when it loses access to it storage (the equivalent of yanking a USB drive out while an application is writing to it).

The real question is under what circumstances a CO might choose to make this call, and that IS a more challenging question (answered below), but I think we should at least be able to agree that once the call is made, there's no recovering the lost data, at that point we're cutting our losses and trying to restore the system to a usable state in an automated way, which is why the force node unpublish/unstage are critical.

The answer the question of why anyone would make the ControllerUnpublishVolume call with the fence flag set, we have to think about workloads that highly value uptime. Historically, these application have been a poor fit for container orchestrator systems like Kubernetes and people have run them in extremely expensive compute clusters. Now that people are trying to run these kinds of applications in container orchestrators, they're looking for ways to achieve similar uptime guarantees that they were able to get in the old world. The reality is that nodes do fail, and while workloads can migrate, shared state makes moving the workload unsafe unless you're SURE the old node is dead and not merely uncommunicative.

Clustered systems have historically solved this with a technique called STONITH (Shoot The Other Node In The Head) which kills the node whether it's dead or alive, making it safe for shared state to be taken over by a new node. There are various ways of achieving the shooting of the other node, including cutting its power (if you have programmatic access to the power supply), or killing the VM running the node (if you have access to the hypervisor) which are both perfectly valid. This CSI spec proposal adds a 3rd way which relies only on access to the storage device and the storage device having the capability to fence a node (not all do, but it's an optional capability). COs are then able to choose the most appropriate technique to make nodes definitely dead when they're trying to rapidly migrate a workload from a node of questionable state to a healthy node.

Presumably COs will also allow policy settings that allow end users to specify their preference for either aggressively high uptime or a more restrained response that allows workloads on dead nodes to stay there in hopes that they might still be alive. In no case does this need to be a manual response. The policy can be set up front and the system can execute it, fencing workloads for which the policy says to prefer uptime very quickly after the node becomes unreachable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/container-storage-interface/spec/pull/477#issuecomment-855322408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR5KLD2VEUAKDLLLUEWT23TRLHIPANCNFSM43TUTY6Q .

xing-yang commented 3 years ago

@jdef There's a KEP in Kubernetes that is planning to use this CSI spec change. See here: https://github.com/kubernetes/enhancements/pull/1116

bswartz commented 3 years ago

Thanks for the detailed response. Can you provide a policy example of how a CO allows a user to state that they are OK with data loss? Because otherwise this feels like "build it and they will come" .. but maybe they won't because users aren't asking for this? Or maybe they are.. can you share? It's a cool feature, and I don't see how CO's are ready for it. Do you?

The way I think sophisticated users think about this is: node failures aren't some unlikely event to be avoided, they're an inevitable certainty at scale and you can calculate the expected frequency of them with knowledge of the underlying infrastructure and its failure modes. Given that node failures can be challenging to detect reliably, most systems (like Kubernetes) rely on a simpler heuristic, like absence of X heartbeats, to determine a node failure. Heuristics like this will yield false positives with some frequency, and treating a node like it's failed when it hasn't actually failed can lead to even bigger problems when the workload on the pod has access to external storage.

I want to stress that the choice the end user has to make is not between accepting data loss and not accepting data loss -- it's between handling the false-positive case aggressively or conservatively. When nodes actually fail, data gets lost and the user's only choice is how patiently to wait to find out that's what really happened. When the system falsely detects a failed node, the choice is between causing a failure by evicting the workload forcefully and waiting to discover that the failure detection was in fact false.

The knobs that Kubernetes gives end users are pod-level tolerations to the node taints that Kubernetes applies to nodes that it deems unreachable or not-ready. By default pods tolerate these taints for 5 minutes, after which the pod eviction manager kills them and tries to reschedule them. Users that value higher workload uptime would tune these tolerations down from 5 minutes to something like 10 seconds. Users that prefer to not disturb pods might tune this value even higher.

Administrators have additional knobs that let them tune the node-failure detection thresholds to be more aggressive (lower timeouts) or more relaxed (higher timeouts).

I don't think you need more control that allowing users to set these timeout values. The proposed change in this PR and the KEP that @xing-yang linked is to adjust what we do after the pod eviction manager decides to kill the pod and reschedule it elsewhere. Today the workload can get stuck because while the system wants to move the workload to a new node, it is unable to do so safely because of the presence of an external volume that could get corrupted of 2 nodes mounts it at the same time. The PR simply allows the eviction to occur safely by ensuring that no more than 1 node has access to the volume at a time.

bswartz commented 3 years ago

@jdef Thank you for all the grammar suggestions, they have been implemented.

I plan to squash this PR when everyone is satisfied with it.

saad-ali commented 3 years ago

https://github.com/container-storage-interface/spec/pull/476 and https://github.com/container-storage-interface/spec/pull/468 have merged.

I will hold off on cutting CSI.next RC until tomorrow (Tuesday, June 8) to see if we can get all the approvals in for this PR as well. If not, we will proceed with CSI.next RC without this PR.

bswartz commented 3 years ago

Rebased, will squash after approvals from @jdef @gnufied

jingxu97 commented 3 years ago

/hold

jdef commented 3 years ago

I've thought more about this and I'm still not sold. This smells like a bigger infra problem, not a volume one. E.g. we could intro a controller rpc called "NodeGone" that the CO uses to tell the SP that a node was shot, and so bookkeeping is needed. Let someone else handle shooting the node, and once bookkeeping is done the workloads are migrated to another node with access to required volumes.

On Mon, Jun 7, 2021, 3:52 PM Ben Swartzlander @.***> wrote:

Rebased, will squash after approvals from @jdef https://github.com/jdef @gnufied https://github.com/gnufied

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/container-storage-interface/spec/pull/477#issuecomment-856211153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR5KLAABWLRE3FEESB6KA3TRUPQDANCNFSM43TUTY6Q .

jingxu97 commented 3 years ago

I see @jdef raised a number of questions in the kep. I think since this change is depends on the logic in the kep, maybe better to hold on merging this one

bswartz commented 3 years ago

I've thought more about this and I'm still not sold. This smells like a bigger infra problem, not a volume one. E.g. we could intro a controller rpc called "NodeGone" that the CO uses to tell the SP that a node was shot, and so bookkeeping is needed. Let someone else handle shooting the node, and once bookkeeping is done the workloads are migrated to another node with access to required volumes.

I like this proposal a lot from the SP implementor's side, because fencing ALL of the volumes from a given node is technically easier than fencing volumes one by one. However I worry this makes the CO implementor's job harder, because I'm presuming that volume state is tracked individually, and if there are a large number of volumes published to a particular node, with perhaps a mix of more than one SP, you'd end up doing some kind of batch processing of the volumes on a per-SP basis. I'm open to this idea though.

jdef commented 3 years ago

I've thought more about this and I'm still not sold. This smells like a bigger infra problem, not a volume one. E.g. we could intro a controller rpc called "NodeGone" that the CO uses to tell the SP that a node was shot, and so bookkeeping is needed. Let someone else handle shooting the node, and once bookkeeping is done the workloads are migrated to another node with access to required volumes.

I like this proposal a lot from the SP implementor's side, because fencing ALL of the volumes from a given node is technically easier than fencing volumes one by one. However I worry this makes the CO implementor's job harder, because I'm presuming that volume state is tracked individually, and if there are a large number of volumes published to a particular node, with perhaps a mix of more than one SP, you'd end up doing some kind of batch processing of the volumes on a per-SP basis. I'm open to this idea though.

I'd much rather add the majority of the complexity to the CO side of the house, instead of plugins.

Of course, with an API like NodeGone the question becomes "is the node really GONE, or will it come back - and if it comes back, will it have been rebooted (to start w/ a clean slate) or was it just disconnected from the network for a while?". So we'd probably need to carefully define the expectations here. I know that Mesos wrestled with this for a bit. I'm not sure if there's already an equivalent of "gone" in k8s. Either way, the CO should really be certain that a node is "gone" before invoking the CSI RPC to signal that state.

YuikoTakada commented 2 years ago

This comment is posted. https://github.com/kubernetes/enhancements/pull/1116/files#diff-0225593bb1191b37cc24ad60c172668c3df10b62f2fd748ceb1bbe85ddf078ceR107

This would trigger the deletion of the volumeAttachment objects. 
This would allow ControllerUnpublishVolume to happen before NodeUnpublishVolume and/or NodeUnstageVolume are called.
Note that there is no additional code changes required for this step. 

Is this true? In short, when volumeAttachment has been deleted, ControllerUnpublishVolume can happen before NodeUnpublishVolume and/or NodeUnstageVolume are called without the change which is suggested in this PR?

jdef commented 2 years ago

Out of order CSI calls are not spec compliant. I've said as much in the KEP

On Tue, Dec 28, 2021, 3:06 AM Yuiko Mouri @.***> wrote:

This comment is posted.

https://github.com/kubernetes/enhancements/pull/1116/files#diff-0225593bb1191b37cc24ad60c172668c3df10b62f2fd748ceb1bbe85ddf078ceR107

This would trigger the deletion of the volumeAttachment objects. This would allow ControllerUnpublishVolume to happen before NodeUnpublishVolume and/or NodeUnstageVolume are called. Note that there is no additional code changes required for this step.

Is this true? In short, when volumeAttachment has been deleted, ControllerUnpublishVolume can happen before NodeUnpublishVolume and/or NodeUnstageVolume are called without the change which is suggested in this PR?

— Reply to this email directly, view it on GitHub https://github.com/container-storage-interface/spec/pull/477#issuecomment-1001927140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR5KLFH5RPDMHMIJWZMTP3UTFVXXANCNFSM43TUTY6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

bswartz commented 2 years ago

Out of order CSI calls are not spec compliant. I've said as much in the KEP

I agree with this. And because reliably and quickly recovering from a node failure requires making the CSI calls out of order, this PR proposes allowing the out-of-order calls under special and strict conditions. Today what we see happening in Kubernetes is just willful violation of the spec because there's no "correct" way to do this, and I'd prefer to update the spec with a correct way.

cnmcavoy commented 3 months ago

Hi, as an author of a CSI driver, I'm interested in the status of this change. Does it need to be rebased and re-approved, or is there more substantial changes needed to cross the finish line?

xing-yang commented 3 months ago

cc @bswartz