Open bswartz opened 3 years ago
As stated in the writeup of this ticket, there's already a mitigation for this case: the plugin encrypts data before passing it back in the volume context, and the CO plays back the same, per-volume context in future calls. To expand upon that case, plugins could, instead, pass back a verifiable, signed token (e.g. JWT, keytab, x509 cert, or something similar that is, perhaps, not time-limited) in the volume context, and the CO replays that to the plugin - at which point the plugin validates the token and exchanges that for a secret that it needed in order to interact securely w/ a backend system. There are pros/cons of each approach. An important bit here is that the CO is permitted to safely log volume context, and so care should be taken to avoid leaking anything too sensitive here.
Since there's already a way to mitigate this from a plugin perspective, it does not seem mandatory to solve for this particular use case via CSI spec changes. If any change is needed, it seems that the minimal requirement is that a plugin can describe to the CO "maybe don't log some subset of information that i'm handing to you". Then again, if it's not safe to log .. is it really safe to persist to disk in cleartext? We could craft a "classification" scheme which differentiates among several sensitivities, but that seems a bit over-engineered? Any "must be persisted securely" classification brings us back to the original motivation of this ticket.
ALL THAT SAID...
Let's boil down the scenarios being mentioned here:
A. (today) CO passes secrets to plugin in RPC requests B. (proposed) plugin passes secrets to CO in RPC results; CO replays those secrets in future RPC requests
To be clear, the intended scope of (A) is not "static" secrets (which are generally configured by the thing that executes the plugin binary and passes this information via envvar or flag), but "dynamic" secrets that can change from call-to-call. Passing dynamic secrets from CO to plugin does not dictate where this secret information is initially configured and/or stored. There's an implicit assumption that either an administrator or user knows enough about the volume provisioning process to wire things up so that the CO has access to the right secret at runtime, particularly at the point in time at which the RPC is executed. There is no implicit assumption or explicit requirement that the CO is involved in the storage of such secret information. A lowest common denominator CO simply acts as a pass-through system for secrets that are persisted elsewhere. COs are never expected to safely store sensitive information for any amount of time, under any circumstance. The spec does say that sensitive information passed through RPCs like this should be treated as such, and so LCD CO implementations are responsible for the safe handling of it, in transit.
Some consequences of (A):
It's somewhat orthogonal, but worth noting that the CSI spec plays no role in actually securing data in transit. The spec recommends the use of a UNIX socket, but in practice gRPC calls may also be made over TCP sockets and CSI does not prescribe a solution for securing such communication. It's assumed (probably generally, but at least by myself) that uses of TCP w/ CSI gRPC probably involve a TLS-secured channel, and that such approaches are "good enough" for securing information in transit.
Somewhat more directly related is that CSI APIs are idempotent. While there are different perspectives exist on the philosophy of idempotent operations, a common ground seems to be that the same API call made repeatedly, with the same request parameters, should not result in a distinctive state change of the target system: once the first call has succeeded, repeated calls have no side effects. This seems possibly important w/ respect to replay attacks, and secret lifetime, if we begin to ask a CO to participate more actively in secret lifecycle management (B).
The scope of (B) seems considerably more involved than that of (A).
In summary:
š on this proposal
xref #116 #123
I'm sympathetic to the idea that COs should not be treated like a general purpose data store, and that storing secrets is a uniquely complex responsibility. The problem is that somebody has to do this work, and pushing the responsibility away from the CO doesn't make it go away.
SPs currently rely somewhat heavily on the volume context / publish context mechanism of controller/node communication, because there are good architectural reasons to avoid node plugins being able to access the storage device for any purpose other than data access. This means that a communication channel between controller and node is required, but in order to ensure scalability we use the CO the provide this channel rather than asking the SPs to invent their own communication channel.
I feel like the mere existence of the volume/publish context string maps are an admission that it's better for the CO to handle this controller to node communication than to require SPs to sort that out themselves. Clearly there was attempt to balance this burden on the COs by limiting it's scope (only 4KiB of data) and direction (only controller to node, with the exception of the NodeIDs). The constraints on this communication channel however, (in this case, lack of security) continually generate incentives for SPs to not use the CO-supplied mechanism and to instead invent their own communication mechanisms.
I'm struggling to know if we've drawn the line in the right place, given that SP authors keep finding it insufficient. I wonder if it was a mistake to even attempt to obviate the need for a SP-managed communication channel between nodes and controllers, and if it would have been better to encourage direct communication from the beginning.
I'm struggling to know if we've drawn the line in the right place, given that SP authors keep finding it insufficient.
This is an interesting point. There was another, related discussion about CSI introducing a general purpose communication-hub/bus API, and we very intentionally decided that was out of scope: plugins that need anything other than simple cookies are on their own to implement (or leverage, via some other infra component) a more complicated communication channel. This mostly seems to resurface every time a plugin author wants their node plugin to communicate back to the controller plugin.
The constraints on this communication channel however, (in this case, lack of security) continually generate incentives for SPs to not use the CO-supplied mechanism.
I think it probably depends on the use case here, but I could be wrong. It's been a while since I've dug into various OSS CSI implementations to see how this is being leveraged. One of the challenges is that the spec tries to accommodate KISS plugins, alongside those that are much more heavy weight. Along with supporting multiple plugin deployment architectures. If we cover 80% of use cases, is that good enough? Are we even hitting that mark?
Back to what's being proposed this plugin: I've had a brief chat w/ core Mesos folks. Take aways:
Another thought: given solutions like Vault's "transit" engine (or other things that e.g. SOPS can plug into), I'm wondering why asking plugins to do the work of encrypting sensitive information (which only some plugins need to do) for use within an insecure context is overly burdensome. After all, it seems good enough for gitops use cases.
The way we've handled CO "capabilities" in the past is by adding new optional arguments to RPC calls. Callers that support the new capability assert so by setting the argument to the non-default value. This is a signal to SPs that they can leverage additional return values. The spec would have to make clear this requirement -- that the return value is ignored unless an input parameter has a particular value.
I'm with you on the fact that it's very hard to make a one-size-fits-all solution at the CSI spec level, and that it makes more sense to aim for the 80% case. Maybe where we could do some useful work would be to spell out what we think the limits of the architecture are, and give some guidance to the 20% on what they should consider instead when they run into the limits of what the spec allows.
The example your raised 3 comments above, about secret rotation, is an excellent example of where the spec doesn't offer the kinds of tools one really needs, and it would be helpful to spell out how SPs are expected to tackle those kinds of problems. I'd like to have a library of proofs-by-example that these problems are in fact solvable without changing the spec, so that we can point developers there first when they complain about perceived deficiencies in the existing architecture.
In the CSI spec today, all secrets are stored on the CO side, and sent to the SP side at appropriate times. This covers use cases where the secrets are administrator-created (such as login credentials for the storage device) or user-generated (such as per-volume encryption keys) but there's a third class of use cases where the secret it meant to be used by the node to securely connect to the storage device.
For example, iSCSI CHAP secrets fall into the 3rd category. The node needs them to connect to an iSCSI LUN, and they're sensitive information, because if at attacker obtains the CHAP secrets, he could use them maliciously. Typically, neither the user or administrator is interested in knowing such secrets -- they just need to be securely arranged between the storage device and the CO node.
Given the existing design of CSI, the options for supporting this third use case are all suboptimal. An SP may:
It would be better to modify the CSI spec to allow SPs to return both secret and non-secret context information for volumes at CreateVolume and ControllerPublish times. We could mark the additional returned string map as "secret" thus preventing sidecars from logging it, and allowing the CO to store such information securely (as securely as any other secret required for correct SP operation).