cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.92k stars 3.78k forks source link

pgwire,ui: rate-limit connection attempts #47602

Open knz opened 4 years ago

knz commented 4 years ago

We need rate limiting to prevent pre-auth DoS attacks. Both in SQL and UI which may be exposed in public networks.

Two levels of rate limiting:

cc @aaron-crl @bdarnell

Epic CRDB-13333

Jira issue: CRDB-4396

bdarnell commented 4 years ago

We'll use this in at least two places (three, if you count SQL and UI auth separately): for authentication requests and for the pgwire cancellation protocol (not yet implemented, but see #34520) which exists outside the authentication regime.

ajwerner commented 3 years ago

For mitigating some of the resource usage problems due to bcrypt hashing, we could put a semaphore on concurrent hashing which we control with a cluster setting. Moving it down to 1 seems likely help and is extremely cheap to add in terms of engineering time.

ajwerner commented 3 years ago

I guess rate limit vs concurrency semaphore are broadly the same in terms of complexity of implementation given the quotapool. We should probably do something simple here in the short term even if it doesn't mean doing anything distributed or origin-oriented.

knz commented 3 years ago

Sounds good. What's the desired timeline?

rafiss commented 3 years ago

I think this sounds good too. Could we do something similar for the pgwire cancellation protocol? There has been more desire from Product for us to implement it, and as described by @bdarnell this protocol exists outside of any authentication mechanism so we need a separate rate limit for it. I feel like a per-node concurrency semaphore could be sufficient to prevent DoS cancellations as opposed to making a full IP-based solution or a cluster-wide limit.

Edit: I wonder if the concurrency factor of this semaphore could change dynamically based on the number of nodes in the cluster. That would give us a bit more safety as per the back-of-the-envelope math in https://github.com/cockroachdb/cockroach/pull/34520#discussion_r408366441

bdarnell commented 3 years ago

A semaphore for bcrypt hashing is a good idea and we should do it right away to mitigate outages caused by connection pool restarts. But I don't think it satisfies the need for rate limiting. This would prevent login attempts from overloading a node and having too much impact on concurrent queries. The problem is that it makes another kind of DoS easy - without something smarter, an attacker can queue up a ton of login attempts and keep the password-hashing thread busy enough that legitimate login attempts can't get through in a reasonable amount of time.

For cancellation, I think the semaphore probably would help things. I'd rather have real IP-based blackholing based on failed cancellations, but the semaphore alone would probably keep an attacker from getting enough guesses per second to make it an attractive target to cancel random queries.

ajwerner commented 3 years ago

an attacker can queue up a ton of login attempts and keep the password-hashing thread busy enough that legitimate login attempts can't get through in a reasonable amount of time.

We could add a sleep of some configurable duration after each failed attempt before reporting the failure. At some level, if the attacker can keep dialing new connections then we're SOL. Dealing with that requires rate limiting connections, probably at a higher level.

knz commented 3 years ago

Hi folk, can you help me synthetize the work requested so far? The cancellation protocol is not implemented at this time so that's a no-op for now.

What are the remaining semaphores needed? Is this something that can be achieved using a local algorithm in the server acceptor, or do we need integration with some acceptance algorithm elsewhere in crdb?

knz commented 3 years ago

(generally I am willing to contribute some cycles into implementing a solution and a unit test, but I don't have cycles available to participate in the design and the thinking. So please spell that out for me.)

bdarnell commented 3 years ago

The immediate work that is requested is to put a global mutex or semaphore around calls to bcrypt.GenerateFromPassword and bcrypt.CompareHashAndPassword so we can limit the amount of CPU these can consume. Semaphore is probably better - it'll have a limit of 1 on small nodes but a mutex might be an issue for connection pool restarts on larger nodes. Maybe default it to something like max(1, NumCPU/16) (customizability is nice but I wouldn't spend too much time on it). We also need to make this cancellation-aware - don't waste time hashing the password for clients that drop their connections while waiting for the semaphore. Other than adding context arguments to the relevant methods, I think this can all be localized to pkg/security/password.go.

knz commented 3 years ago

The linked PR #66028 takes care of protecting the bcrypt calls behind a semaphore.

We also discussed whether it would be useful to protect the entire authn handshake: even with the bcrypt semaphore, an attacked can still cause a DoS by forcing username lookups with an invalid username (sql.GetUserHashedPassword is called unconditionally for every new SQL connection).

@bdarnell explains that a semaphore is not the right tool to provide DoS protection in the general case:

Instead, for the general case we should look at an acceptance control mechanism, with different rate limiting behavior per client IP address.

knz commented 3 years ago

@bdarnell @RaduBerinde am I right to udnerstand that the problem stated in this issue is obviated by acceptance control?

RaduBerinde commented 3 years ago

I don't know what acceptance control you are referring to. If it's the new admission control system that @sumeerbhola created, that does not deal with session establishment.

rafiss commented 2 years ago

For now, we'll focus on completing https://github.com/cockroachdb/cockroach/issues/35428 first -- it's a limit on the total number of open connections on a single node, which is simpler to implement and also has less possible unintended consequences when someone hits the limit.

ajstorm commented 2 years ago

Removing from multi-tenant board, as this isn't strictly a multi-tenancy issue.