basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.95k stars 537 forks source link

Add Security to Riak #355

Closed Vagabond closed 10 years ago

Vagabond commented 11 years ago

This is a tracking meta-issue for the cross-repo task of adding security to Riak.

Rationale

Riak (not Riak CS, which has its own application layer security model) currently has no authentication or authorization, nor does it provide encryption for the Protocol Buffers API (HTTPS is optional for the REST interface). In most deployments, Riak is deployed on a trusted network and unauthorized access is restricted by firewall/routing rules. This is usually fine, but if unauthorized access is obtained, Riak offers no protection to the data that it stores.

Thus, I propose we add authentication/authorization/TLS and auditing to Riak, to make Riak more resilient to unauthorized access. In general, I took the design cues from PostgreSQL. Another goal was to make this applicable to riak_core, so any reliance on KV primitives or features are intentionally avoided.

Authentication

All authentication should be done over TLS for two reasons, to avoid MITM attacks and to prevent eavesdroppers sniffing credentials. Self-signed certificates are acceptable if the client checks the server's certificate against a local copy of the CA certificate (and thus we can avoid the complicated 'web of trust' used by regular HTTPS). CRL checks should be done, when it is appropriately configured.

Once TLS has been negotiated and verified, the client supplies username/password credentials. The password is transmitted in the clear, this is to facilitate writing pluggable security backends. This is not a major problem because at this point the connection should be proof against eavesdropping.

The pluggable security backends we propose to implement are the following:

Postgres auth methods

Authentication information is split into two pieces, users and sources. A user cannot authenticate without a corresponding source that matches username/peer address.

Postgres authentication source configuration

To add a user named andrew and to trust all connections from localhost, you'd do:

riak-admin security add-user andrew
riak-admin security add-source all 127.0.0.1/32 trust

To add a user that you wanted to authenticate against the local password table and be allowed to connect from anywhere:

riak-admin security add-user sean password=justopenasocket
riak-admin security add sean 0.0.0.0/0 password

The password provided at user-creation time is hashed via PBKDF2 and stored.

To trust users on the LAN but to force everyone else to authenticate against PAM:

riak-admin security add-source all 192.168.1.0/24 trust
riak-admin security add-source all 0.0.0.0/0 pam service=riak

The service=riak option tells PAM to submit any provided credentials to that particular PAM service configuration. Sources are compared most to least specific, both by the user match and the CIDR match (specific usernames sort before 'all' and a /24 sorts before a /0). Only the first matching source is tested, if that fails, the authentication fails.

Authorization

Riak currently has a completely permissive approach to data access, if you can connect, you can get/put/delete anything you want. Providing authentication, as in the above section, raises the bar to that kind of access, but it still leaves your data vulnerable to a compromised client, especially if you have something like a lower security reporting application (or even a remotely hosted one with a hole punched in the firewall). This also makes anything like multi-tenancy impossible (think hosting multiple phpbb instances on a single mysql server).

Thus, in addition to authentication, we also need authorization. This is a major change to Riak's semantics, especially when it comes to creating buckets, which in Riak now is as simple as writing a key to the bucket you want to create. For applications that want to dynamically create buckets, we need to provide some way to give them authorization to do so, without compromising the ability to provide security.

To that end, I propose that authorization be checked on a per-bucket basis. Users are granted granular permissions (registered by the individual riak_core applications):

riak_core:register(riak_kv, [{permissions, [get, put, delete]}, ...])

The permissions are namespaced by the registering application, so the above permissions become riak_kv.get, riak_kv.put, etc. These permissions convey no meaning to riak_core, the application is in charge of indicating what permissions are required for each operation.

Examples of granting permissions:

riak-admin security grant riak_kv.get ON mybucket TO andrew
riak-admin security grant riak_kv.get,riak_kv.put ON mybucket TO sean

To preserve the ability to dynamically create buckets whose name is not known beforehand (think buckets per-username or something), I propose the ability to GRANT based on a bucket prefix:

riak-admin security grant riak_kv.put ON myapp_* to andrew

Thus, the application connecting with the 'andrew' credential can create unlimited buckets that begin with 'myapp_', but has no access to buckets outside that prefix space.

Additionally, perhaps you want to give a user access to everything, Riak could support the ALL permission, and the ANY target:

riak-admin security grant ALL on ANY to andrew

This would effectively provide the old unlimited access that Riak currently has, but still provide some security.

It may also be interesting to wildcard permissions by application, eg. 'riak_kv.*'.

As the superuser giveth, he may also taketh away:

riak-admin security revoke riak_kv.get ON mybucket FROM andrew

Grants and revokes are currently stored separately. The goal is to make users/sources/grants strongly consistent and revokes eventually consistent. That way, during an outage (possibly caused by a malicious/co-opted user account), you can revoke without requiring complete cluster availability, but you avoid problems with partial grants, etc.

Auditing

Since every operation will now be tied back to a user account, we should be able to audit what user did what and when. To that end I plan to extend lager to support alternate event streams (with a separate gen_event) and use that as an audit logging facility. Pairing that with the syslog backend, you'd be able to ship the logs off the machine and so make them harder to tamper with. This is a stretch goal for this development cycle.

Migration

When this work drops in the next major release, existing deployments will have to migrate. For at least existing deployments, the security stuff should default to off. When the user is ready to turn it on, they'll need to have upgraded their client libraries to support it as well as deployed SSL certificates to all the nodes, signed by the same CA. Until that switch gets flipped, clients will work exactly as they do now.

Open Questions

HTTP: https://gist.github.com/Vagabond/05b7dc8ae6d3ca4af6c2 PBC: https://gist.github.com/Vagabond/6222793a1d352f1ccdd2

Work in Progress

Partial implementations of all of this may be found in the 'adt-security' branch of the following repos:

coderoshi commented 11 years ago

Concerning user groups, I'd vote yes. If a group of permissions could be bundled (aka roles), then users could be assigned to a group/role rather than granted/revoked permissions individually. This could prove helpful, not only in implemented RBAC, but also reduce the complexity of defining multiple users with similar complex roles.

glassresistor commented 11 years ago

+1

bkerley commented 11 years ago

Certificate - The client provides a certificate, signed by the same CA as the server's certificate. If this certificate is validated and the common-name matches the requested username, the user is authenticated.

This should include a configurable Certificate Revocation List; otherwise untrusted clients can't be removed without basically starting the CA from scratch.

peschkaj commented 11 years ago

+1

Groups are important, especially in the LDAP/ActiveDirectory world.

aphyr commented 11 years ago

There are fundamentally three modes for TLS' authentication model:

  1. Assume you create a separate client key and certificate for each user. Both client and server verify each other's certificate. These channels are secure against MITM attacks, even where the client's CA chain is tampered with. Completing the TLS handshake proves the client's identity to Riak. A username and password is not required.
  2. Assume clients generate their own certs, and keep the server's CA chain on hand to verify the server's certificate during channel negotiation. These channels are secure against MITM attacks so long as an attacker cannot manipulate the client's CA store. A username and password is still required to authenticate the channel, since the client is anonymous.
  3. Assume totally anonymous mode: neither the client nor the server are authenticated. This mode is trivially vulnerable to MITM attacks. Since usernames and passwords are required in this case, attackers can readily capture credentials and impersonate any user.

I recommend either 1.) fully authenticated or 2.) server-authenticated TLS channels. This mandates the use of a certificate store on each client. While you're going to that trouble, it might make sense to also generate and store client certificates as well--and indeed, the proposal requires the secure storage of client keys and certificates; which means you'll need a key distribution scheme for clients.

Given the presence of client secure storage for client keys and certs, you might as well encode the user credentials in the certificate directly. This removes the need for passwords and their secure storage on the server, which reduces the attack profile. It also removes the need for a separate username/password auth channel in the Riak protocol. That'd make it simpler for client maintainers to add auth support to their clients, since they can rely on the TLS protocol to do the work for them. Clients only have to store/configure [key, cert], instead of [key, cert, username, password]. User access can be cancelled via the usual CRL techniques.

jj1bdx commented 11 years ago

Canola (a PAM authenticator module) at https://github.com/basho/canola should be included in the Work In Progress list.

Vagabond commented 11 years ago

@bkerley Yes, CRL support is something I forgot to cover, will update the text.

@aphyr Yes, I plan to support both 1 and 2 (but not 3). If you want to use the certificate authentication mode (with no additional password), we require you to handshake via method 1. This is already sort of implemented for the PBC protocol and the riak_test shows its use:

https://github.com/basho/riak_test/blob/adt-security/tests/pb_security.erl#L94 https://github.com/basho/riak_test/blob/adt-security/tests/pb_security.erl#L133

randysecrist commented 11 years ago

Will the ACL style permissions preclude / (make it difficult) to use a more capability driven security model (oauth scopes) in a future phase? Has oauth been discussed?

aggress commented 11 years ago

How would this work with MDC?

cdahlqvist commented 11 years ago

It would be useful to also have authorization for list_keys, list_buckets and secondary index queries as well as for the ability to run mapreduce queries.

seancribbs commented 11 years ago

@cdahlqvist From my reading of the proposal, all operations will be authorized and audited.

Vagabond commented 11 years ago

@aggress MDC doesn't deal with client authentication at all, it already has support for TLS based authentication, where both sides verify the other's certificate.

@cdahlqvist Yes, EVERY operation (or almost all of them) will need a permission associated, I just don't know what some of them will look like yet. The only exception to the rule may be the stats and ping endpoints, you may be able to hit them simply by being authenticated, not sure yet.

aggress commented 11 years ago

@Vagabond I was thinking more along the lines of will users/roles created in cluster a) be replicated over to cluster b) or will they need to be set up individually and how might things work with such things like cascading writes?

Vagabond commented 11 years ago

@aggress Yes, that is a good question, will add it to the open questions section. My initial feeling is to NOT replicate that information, but we'll see.

aggress commented 11 years ago

How might commit hooks be handled? stopping user a from using a commit hook that updates a bucket only user b has access to

ghost commented 11 years ago

Oh yeah, that reminds me: Erlang mapreduce is basically a wide-open door for arbitrary code execution, including, I suppose, modifying the ACLs themselves, so it should only be accessible to the highest privilege levels.

jrwest commented 11 years ago

re: replication between clusters, that isn't something we planned to support in cluster metadata (where this info will be stored) -- at least initially.

Either way, I agree w/ @Vagabond's initial feeling. Even if the cluster is accessed by the same logical users I would assume they are typically accessed by different hardware (external LB at the least, probably different application servers).

tarcieri commented 11 years ago

Something Riak might consider is a capability-based security model for granting access to buckets. I think capability-based security could fit extremely well with Riak's key/value storage model and have done a bit of work in this space.

Under this model, authentication could be handled using whatever mechanism is desired (e.g. mutual TLS), but to authorize access to a particular bucket, the client would need to present a bucket-specific token, which could actually be a combination of cryptographic keys (known as a crypto-capability model).

I've implemented a generic key/value store encryption system which works with Riak among other key/value stores here, if you're interested in seeing a real-world example of what I'm describing. My scheme encrypts both keys and values, allows data to be accessed using only encrypted keys, and allows clients to decrypt the key names if so desired:

https://github.com/cryptosphere/keyspace

The best part of this approach is that it has minimal impact on Riak. In fact, encryption is orthogonal, and something only clients would have to support. The only thing that would have to be added to Riak itself is a digital signature check (along with a timestamp check to prevent replay attacks) to ensure values being written are authentic.

randysecrist commented 11 years ago

@tarcieri I like your work on keyspace, and it pairs with what I was going for when I asked about a capability based model a bit earlier. +1 for this.

camshaft commented 11 years ago

+1 capabilities. It's generally easier to understand and more secure. Managing ACLs becomes cumbersome very quickly from my experience.

Vagabond commented 11 years ago

So, after a bunch of reading and internal discussion, I think we're going to stick with ACLs, for the following reasons:

However, I think I will add postgres style roles, as a way to implement 'groups'.

tarcieri commented 11 years ago

Sad to hear that :(

I buy the familiarity argument, but that's really the only thing ACLs have going for them over capabilities. Since capabilities solve the AuthZ problem, you can still use MTLS to solve AuthN and revoke access that way. Audit logging can be used to spot abuse of capabilities.

Relevant: Zed Shaw - The ACL Is Dead

Vagabond commented 11 years ago

I am watching that talk, but I'm struggling to extract much of relevance from it. It is sort of like a tech-related street performance that occasionally touches on ACLs en-route to bagels, smoked meat, corporate greed, the incompetence of MBAs, etc.

The three key points he seems to make are:

Maybe I'm dense, but I don't understand why those are even a problem. I understand his point about ridiculous business rule requirements about time and situation dependent ACLs, but Riak does not really have that problem.

Right now my takeaway is this: ACLs for people != ACLs for applications. Applications rarely need time-dependent or situation dependent access to data, they have their data and they want to access it whenever they need to, and these access rules change rarely. Riak is not a document management solution, it is a database. It is used by applications, not people.

I'm happy to have a discussion about this, but providing references to things like the 'authz problem' that is not a 1:10 stream of consciousness rant about all sorts of unrelated things would help your case a lot more. It is fairly telling that none of the questions at the end were even about ACLs at all, beyond one question about making what Zed did into a product.

tarcieri commented 11 years ago

Haha, sorry about that. But I hope it drives home that ACLs are in an uncanny valley between a capability based system and Turing-complete code for providing AuthZ.

Waterken Web describes some of the tradeoffs of capabilities vs ACLs:

http://waterken.sourceforge.net/

You might also take a look at how Tahoe-LAFS implements "writecaps" and "readcaps" for its mutable files. You wouldn't need anything so elaborate, just a digital signature:

http://eprint.iacr.org/2012/524.pdf

Tahoe ends up providing something that looks an awful lot like an encrypted version of Riak, sans many of the features that make Riak compelling as a database (read repair, vector clocks, 2I, etc)

Vagabond commented 11 years ago

Maybe we can narrow the conversation here. When would the confused deputy problem occur for Riak, along the lines of the compiler example here:

http://waterken.sourceforge.net/aclsdont/current.pdf

Vagabond commented 11 years ago

I guess my biggest sources of mystification are the following:

tarcieri commented 11 years ago

@Vagabond that's not a question I can answer until you have defined a threat model. Only then can you enumerate potential attacks and choose defenses.

I can perhaps enumerate why ACLs don't work in practice with an example threat model:

Threat: We want to give Alice, but not Mallory, AuthZ to X even though both Alice and Mallory can both AuthN to the service providing X and Alice and Bob are conspirators Capability attack scenario: Alice gives Mallory the capability to access X. Mallory can then access X. Our audit logs reflect Mallory accessing X ACL attack scenario: Alice downloads X and gives it to Mallory. Mallory now has X. Our audit logs reflect Alice accessing X, not Mallory. Now we have the problem that Alice is authorized to access X and thus this may appear to be normal behavior, combined with the fact that Mallory gaining access to the content is not reflected in the audit logs.

In the end the result is the same, with some caveats: In the capability scenario, we see Mallory accessing the resource illicitly, but don't learn that Alice is a conspirator. In the ACL scenario, we don't learn about Mallory's involvement at all, as it appears that Alice accessed the resource. In the ACL scenario, Alice's behavior in the audit logs looks "normal", because Alice is authorized to access X. In the capability scenario, we can cross check the audit logs with our records of who should be able to access what, and determine that Mallory accessed X illicitly.

Thus, while capabilities are shareable, it's probably in Mallory's best interest to act as if they weren't and obtain X through a conspirator, lest his actions show up in the audit logs. In other words, while the fact capabilities are shareable appears to be disadvantageous, it's actually in the attacker's best interest not to take advantage of this fact, lest their actions appear in the audit logs. A sophisticated attacker will want to piggyback their attack on normal looking behavior as this will make it harder to detect.

What does issuing a capability to a user look like, and how would it work

This is a fairly open-ended question as there are many ways that capabilities can be implemented. I can roughly detail what you could do with the sort of crypto-capabilities model implemented by Tahoe (although in this case I'm only describing how you'd ensure authenticity of data, not confidentiality. Tahoe provides both)

In general capability tokens are considered necessary and sufficient in and of themselves for accessing a particular resource. This doesn't preclude adding an additional mutual TLS layer or what have you to AuthN to the service.

Ideally every part of the system has an associated set of capabilities. All data is individually, uniquely, and securely identifiable. So for starters: every bucket would have separate write/authenticate capabilities, if not every key.

So, at the time you create a bucket, a public and private digital signature key would be generated. The server would store the public key and use it to authenticate writes. The private key would allow new data to be written. The server would mandate that all writes be digitally signed (hopefully with a timestamp to prevent replay attacks)

Requests to write would include some type of request parameter containing a digital signature produced client side by the holder of a private key for a particular bucket or bucket:key combination. The server would authenticate digital signatures before accepting the write.

Vagabond commented 11 years ago

Right, I see that is nice to separate authentication from capabilities from an accounting standpoint and arguably from a flexibility one. However, the overhead seems extremely expensive; a public/private key pair per bucket or key, for something like Riak, would be massive overhead.

There's also the problem that buckets in Riak are 'created' in response to a key being written into them, if they don't have any custom properties, they only exist by virtue of their contents. If you did something like create a cryptographic token the first time you see a bucket without one, you'd still be vulnerable to race conditions, because the cluster may be partitioned, or 2 users may simply be 'creating' that bucket simultaneously. Bucket types may help with that somewhat, but they won't provide everything we need at least in 2.0.

Another concern I have, Riak usually has N of the same thing working with its data, you have N clients for 'Application 1' and then another N for 'Application 2', it isn't like a local filesystem where my user has a bunch of shared resources it could use as a capability store, these are clients running on discrete machines that are using Riak as the 'shared state', adding more requirements so they can share capabilities outside of Riak seems onerous.

Also, I'm still concerned about the client management of these capabilities, what happens if the client's HDD crashes and it loses all the credentials? Does the server retain a copy of the private key? If so, how can you be sure you're giving it to the right users?

Basicially, my problem is this: I remain unconvinced that "ACLs are bad for user access" == "ACLs are bad for application access" and I feel that the complexity we'd incur implementing a crypto capabilities system would not be worth it for the vast majority of Riak's users. Another factor is time, I have a fixed window to get something implemented for 2.0, and I just don't think, all my above concerns aside, that it is doable in that kind of timeframe.

webhat commented 11 years ago

@Vagabond rather than using a public/private key for each bucket you would use these only for en-/decrypting request for a synchronous key this key is used to en-/decrypt, in the same way TLS does it. Some information and references can be found in Key Management.

In the ACLs vs. capabilities discussion I've - perhaps wrongly - viewed them as a question of who has control: Does the user control what can be done with an object? Or can an object control what is done to it by the user?

tarcieri commented 11 years ago

@Vagabond

There's also the problem that buckets in Riak are 'created' in response to a key being written into them, if they don't have any custom properties, they only exist by virtue of their contents

How is this going to work in an ACL scenario? Won't you still have to declare what buckets are accessible to a given user?

I guess more generally: do you plan to restrict the capability to create/modify buckets in any way as part of your security model? What security guarantees to do you intend to provide if you don't do this?

Vagabond commented 11 years ago

Right, but you can say what buckets are accessible without caring if the bucket exists right now or not, since the bucket itself doesn't have to track anything.

The, somewhat modified plan from the above RFC, plan is to use 'bucket types' as part of the ACL, bucket types MUST be explicitly created, unlike buckets, and we can do things like grant a user to write to any under bucket under a particular bucket type, or again define a list of buckets that they can access. Effectively we don't consider bucket creation a thing, simply get/put/delete and read/write bucket properties are the only permissions that matter for a bucket (well, there's also mapreduce/2i and stuff, but that is not relevant).

Another advantage of this approach is you can reject a request cheaply compared to to the amount of work needed to check a capability, and you can do it early in the request without pulling anything off the disk. With a capability per-bucket or per-key, you'd need to actually read something from somewhere to compare the capability the user provided against.

tarcieri commented 11 years ago

Here's another paper on capabilities you might consider reading:

http://www.links.org/files/capabilities.pdf

danostrowski commented 11 years ago

+1 and thanks!

glagnar commented 10 years ago

I am looking into Riak for project requiring a secure distributed. I need to make sure that if one node is compromised, i.e. server has been taken over, it will not be possible to break the entire cluster. For example, by prevention against altering permissions, or changing commit hooks. Will either be possible with Riak 2.0 ?

aphyr commented 10 years ago

@Glagnar: I doubt you'll satisfy that property in any major distributed database without end-to-end cryptographic verification of writes by both all servers and all clients. As an example, take a look at what's required to build http://www.pmg.csail.mit.edu/bft/castro99correctness-abstract.html

coderoshi commented 10 years ago

@Glagnar This is a different sort of security altogether. If a box itself is compromised, the user can simply give themselves any permissions they want via riak-admin.

glagnar commented 10 years ago

@coderoshi Thanks, I know. That was my exact source of worry. In a situation where the server is compromised, could an 'admin password' not solve this issue ? I.e. without password authentication, it should not be be allowed to change for example permissions within the cluster of nodes ? @aphyr I am not sure my issue has this requirement, as it is not initially the 'data' writes I am worried about.

tarcieri commented 10 years ago

@Glagnar if you really want a "trust no one" system where the compromise of a single node has zero impact on the rest of the grid, you might look at Tahoe-LAFS. It satisfies those properties (namely end-to-end cryptographic confidentiality and integrity of all content as @aphyr described): http://tahoe-lafs.org

aphyr commented 10 years ago

@Glagnar @tarcieri Note that Tahoe-LAFS does not provide robustness to a single compromised gateway or client node; only storage nodes.

tarcieri commented 10 years ago

@aphyr well yes, but ideally you separate the Tahoe nodes which provide storage service from the clients which are accessing the content, in which case only the clients see the capabilities/secrets, and the storage nodes are otherwise completely oblivious and see only ciphertexts. In such a deployment, the servers could be compromised without worry

glagnar commented 10 years ago

Is it possible to perhaps setup RIAK in a unidirectional replication manor. I.e. A is master, and B & C are slaves. This means that it does not matter if B or C are compromised. Then 3 clusters could be set up, one where in turn A, B or C is master. A client would then be able to detect if one master had been compromised, by looking at the difference between the three clusters.

coderoshi commented 10 years ago

No this is not possible. All Riak nodes are equivalent. On Nov 12, 2013 3:31 AM, "Glagnar" notifications@github.com wrote:

Is it possible to perhaps setup RIAK in a unidirectional replication manor. I.e. A is master, and B & C are slaves. This means that it does not matter if B or C are compromised. Then 3 clusters could be set up, one where in turn A, B or C is master. A client would then be able to detect if one master had been compromised, by looking at the difference between the three clusters.

— Reply to this email directly or view it on GitHubhttps://github.com/basho/riak/issues/355#issuecomment-28286317 .

sogabe commented 10 years ago

I tried Security extensions with user/CIDR authentication. It seems to work fine. But I can't find how to remove Sources. Could anyone tell me when I should try most of functions?

Vagabond commented 10 years ago

See #434 and related PRs, not all of the security code has landed yet.

sogabe commented 10 years ago

Thanks, @Vagabond

jaredmorrow commented 10 years ago

Closing this as most of what was described here landed in 2.0 pre builds.