Open knz opened 4 years ago
We'll use this in at least two places (three, if you count SQL and UI auth separately): for authentication requests and for the pgwire cancellation protocol (not yet implemented, but see #34520) which exists outside the authentication regime.
For mitigating some of the resource usage problems due to bcrypt hashing, we could put a semaphore on concurrent hashing which we control with a cluster setting. Moving it down to 1 seems likely help and is extremely cheap to add in terms of engineering time.
I guess rate limit vs concurrency semaphore are broadly the same in terms of complexity of implementation given the quotapool
. We should probably do something simple here in the short term even if it doesn't mean doing anything distributed or origin-oriented.
Sounds good. What's the desired timeline?
I think this sounds good too. Could we do something similar for the pgwire cancellation protocol? There has been more desire from Product for us to implement it, and as described by @bdarnell this protocol exists outside of any authentication mechanism so we need a separate rate limit for it. I feel like a per-node concurrency semaphore could be sufficient to prevent DoS cancellations as opposed to making a full IP-based solution or a cluster-wide limit.
Edit: I wonder if the concurrency factor of this semaphore could change dynamically based on the number of nodes in the cluster. That would give us a bit more safety as per the back-of-the-envelope math in https://github.com/cockroachdb/cockroach/pull/34520#discussion_r408366441
A semaphore for bcrypt hashing is a good idea and we should do it right away to mitigate outages caused by connection pool restarts. But I don't think it satisfies the need for rate limiting. This would prevent login attempts from overloading a node and having too much impact on concurrent queries. The problem is that it makes another kind of DoS easy - without something smarter, an attacker can queue up a ton of login attempts and keep the password-hashing thread busy enough that legitimate login attempts can't get through in a reasonable amount of time.
For cancellation, I think the semaphore probably would help things. I'd rather have real IP-based blackholing based on failed cancellations, but the semaphore alone would probably keep an attacker from getting enough guesses per second to make it an attractive target to cancel random queries.
an attacker can queue up a ton of login attempts and keep the password-hashing thread busy enough that legitimate login attempts can't get through in a reasonable amount of time.
We could add a sleep of some configurable duration after each failed attempt before reporting the failure. At some level, if the attacker can keep dialing new connections then we're SOL. Dealing with that requires rate limiting connections, probably at a higher level.
Hi folk, can you help me synthetize the work requested so far? The cancellation protocol is not implemented at this time so that's a no-op for now.
What are the remaining semaphores needed? Is this something that can be achieved using a local algorithm in the server acceptor, or do we need integration with some acceptance algorithm elsewhere in crdb?
(generally I am willing to contribute some cycles into implementing a solution and a unit test, but I don't have cycles available to participate in the design and the thinking. So please spell that out for me.)
The immediate work that is requested is to put a global mutex or semaphore around calls to bcrypt.GenerateFromPassword
and bcrypt.CompareHashAndPassword
so we can limit the amount of CPU these can consume. Semaphore is probably better - it'll have a limit of 1 on small nodes but a mutex might be an issue for connection pool restarts on larger nodes. Maybe default it to something like max(1, NumCPU/16)
(customizability is nice but I wouldn't spend too much time on it). We also need to make this cancellation-aware - don't waste time hashing the password for clients that drop their connections while waiting for the semaphore. Other than adding context arguments to the relevant methods, I think this can all be localized to pkg/security/password.go.
The linked PR #66028 takes care of protecting the bcrypt calls behind a semaphore.
We also discussed whether it would be useful to protect the entire authn handshake: even with the bcrypt semaphore, an attacked can still cause a DoS by forcing username lookups with an invalid username (sql.GetUserHashedPassword
is called unconditionally for every new SQL connection).
@bdarnell explains that a semaphore is not the right tool to provide DoS protection in the general case:
the semaphore is a node-local control. However, the KV lookups by sql.GetUserHashedPassword
use a shared resource (the replica group that supports the system.users range). So there is a chance that the per-node semaphore is not sufficient to avoid starving the capacity of the shared resource.
also, a semaphore on the entire authn mechanism risks locking out legitimate accesses by operators.
Instead, for the general case we should look at an acceptance control mechanism, with different rate limiting behavior per client IP address.
@bdarnell @RaduBerinde am I right to udnerstand that the problem stated in this issue is obviated by acceptance control?
I don't know what acceptance control you are referring to. If it's the new admission control system that @sumeerbhola created, that does not deal with session establishment.
For now, we'll focus on completing https://github.com/cockroachdb/cockroach/issues/35428 first -- it's a limit on the total number of open connections on a single node, which is simpler to implement and also has less possible unintended consequences when someone hits the limit.
Removing from multi-tenant board, as this isn't strictly a multi-tenancy issue.
We need rate limiting to prevent pre-auth DoS attacks. Both in SQL and UI which may be exposed in public networks.
Two levels of rate limiting:
cc @aaron-crl @bdarnell
Epic CRDB-13333
Jira issue: CRDB-4396