apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.06k stars 445 forks source link

Seeing lots of lock contention and CPU around sha512 computation during authentication #2700

Closed keith-turner closed 2 years ago

keith-turner commented 2 years ago

Describe the bug

While testing the scan server feature #2665 I was running a query client I wrote that was doing 512 concurrent small random lookups. These lookups were going to scan servers which were not behaving as expected. I ran java flight recorder and saw contention on 3 locks. Two locks were scan servers specific and one was related to the sha512 computation. I also saw that outside of lock contention, the sha512 computation was the top method for profiling. Even though I saw these issues in the scan server, I suspect the tserver would have the same problem because it uses the same code.

Expected behavior Would be good to avoid the lock contention and research the performance of the sha512 computation relative to small scan.

Screenshots

The following is the JFR data that shows lock contention

Screenshot 2022-05-12 192749

The following is the JFR data that shows hot methods.

Screenshot 2022-05-12 193928

The JFR data came from profiling the scan server for 2 minutes while my test code was running doing 512 concurrent small lookups.

dlmarion commented 2 years ago

Which version of the JVM are you using? I found this which was recently backported to 11.0.12. Also found this and this. I would be curious to see if the performance is better with Java 17 if you are using Java 11. https://github.com/apache/accumulo/pull/2374 also popped up in my searching...

dlmarion commented 2 years ago

I'm wondering if there is any danger in caching the password and zkData input parameters passed to ZKSecurityTool.checkCryptPass when the return value of the method is true. You could even cache it within ZKSecurityTool so that it's encapsulated. I think I'll put up a PR for this idea and we can discuss it.

dlmarion commented 2 years ago

Looking at the stack trace in the first image above the issue is that the Apache Commons Codec Sha2Crypt.sha2Crypt method does not cache and reset the MesageDigest objects that it uses. The sha2Crypt method creates many new MessageDigest objects per method call and at the top of the stack trace in the image, the ProviderConfig.getProvider is synchronized.

ctubbsii commented 2 years ago

Looking at the stack trace in the first image above the issue is that the Apache Commons Codec Sha2Crypt.sha2Crypt method does not cache and reset the MesageDigest objects that it uses. The sha2Crypt method creates many new MessageDigest objects per method call and at the top of the stack trace in the image, the ProviderConfig.getProvider is synchronized.

Is there an upstream issue or patch available for commons-codec?

dlmarion commented 2 years ago

Is there an upstream issue or patch available for commons-codec?

Not that I could find.

Edit: But, this is just one inefficiency here. Even if the commons-codec code was optimized, we would be continuously checking the users password without the caching.

dlmarion commented 2 years ago

Looking at NIST SP800-63B Section 5.1.1.2, it says

Verifiers SHALL store memorized secrets in a form that is resistant to offline attacks. Memorized secrets SHALL be salted and hashed using a suitable one-way key derivation function. Key derivation functions take a password, a salt, and a cost factor as inputs then generate a password hash. Their purpose is to make each password guessing trial by an attacker who has obtained a password hash file expensive and therefore the cost of a guessing attack high or prohibitive. Examples of suitable key derivation functions include Password-based Key Derivation Function 2 (PBKDF2) [SP 800-132] and Balloon [BALLOON]. A memory-hard function SHOULD be used because it increases the cost of an attack. The key derivation function SHALL use an approved one-way function such as Keyed Hash Message Authentication Code (HMAC) [FIPS 198-1], any approved hash function in SP 800-107, Secure Hash Algorithm 3 (SHA-3) [FIPS 202], CMAC [SP 800-38B] or Keccak Message Authentication Code (KMAC), Customizable SHAKE (cSHAKE), or ParallelHash [SP 800-185]. The chosen output length of the key derivation function SHOULD be the same as the length of the underlying one-way function output.

This section doesn't deal strictly with passwords, but I wrote a test to compare the performance difference between the Commons-Codec Crypt approach that is being used currently vs PBKDF2WithHmacSHA512 as suggested in NIST SP800-63B. It looks to be roughly 3x faster.

ctubbsii commented 2 years ago

This section doesn't deal strictly with passwords, but I wrote a test to compare the performance difference between the Commons-Codec Crypt approach that is being used currently vs PBKDF2WithHmacSHA512 as suggested in NIST SP800-63B. It looks to be roughly 3x faster.

Commons-codec Crypt produces a Linux-standard hash suitable for placing in /etc/shadow for a user password (see man 5 crypt). I don't want to do anything custom. If the faster approach can be incorporated upstream into commons-codec, I'd strongly prefer that route over any other alternative. Modern Linux defaults use yescrypt, $y$ hashes, whereas we're using sha512crypt, $6$. I'd be fine with defaulting to a more efficient algorithm, if it is supported by commons-codec and is widely considered at least as secure as sha512crypt.

Something else to keep in mind: faster hashing makes brute force attacks more efficient. The fact that sha512crypt is slower isn't necessarily a bad thing. It would be nice if it didn't use as much CPU, though.

dlmarion commented 2 years ago

I don't want to do anything custom.

To be clear I wasn't doing anything custom, PBKDF2WithHmacSHA512 is a JCA supported algorithm. Looking at CODEC-133, the commons-codec classes implement the Linux crypt algorithm, but I'm not quite sure of it's lineage. There are several things mentioned in that ticket that seem to be merged together.

Something else to keep in mind: faster hashing makes brute force attacks more efficient. The fact that sha512crypt is slower isn't necessarily a bad thing. It would be nice if it didn't use as much CPU, though.

Agreed, my test performed 1000 iterations to match what the commons codec code is doing by default. I don't think we are providing commons-codec with any number of iterations. NIST SP800-63B suggest doing 10,000 iterations, it says:

For PBKDF2, the cost factor is an iteration count: the more times the PBKDF2 function is iterated, the longer it takes to compute the password hash. Therefore, the iteration count SHOULD be as large as verification server performance will allow, typically at least 10,000 iterations.

So, my test is faster doing an apples-to-apples comparison, but would be slower when performing 10,000 iterations. We would still need the caching, IMO, since we are currently checking that the passwords match on each API call.

ctubbsii commented 2 years ago

I don't want to do anything custom.

To be clear I wasn't doing anything custom, PBKDF2WithHmacSHA512 is a JCA supported algorithm.

Right, so, there's the algorithm, but then there's the serialization format. Crypt emits them in a well known serialization format. We'd have to do something custom for the serialization, even if we're using a built-in algorithm. I'd prefer to stick to the crypt standard for serialization, and limit ourselves to the algorithms it supports, rather than use a different algorithm and customize the serialization.

Looking at CODEC-133, the commons-codec classes implement the Linux crypt algorithm, but I'm not quite sure of it's lineage. There are several things mentioned in that ticket that seem to be merged together.

Linux's crypt evolved out of the need to use something stronger than the DES algorithm, but needed to be serialized in a backwards-compatible way, and human-readable since it was going in a text file. They came up with the well-known prefix $<algorithm_id>$. The format of everything after that prefix depends on the algorithm, but ultimately, the hash is encoded using printable ASCII characters. The SHA-2 algorithms support a rounds from 1000 to 999,999,999 (also known as a "CPU time cost parameter" in the man page) parameter (defaults to 5000) and a 6-96 bit salt that is base-64 encoded.

If there are several things mentioned in the ticket, it's probably because there are multiple algorithms that are supported.

Something else to keep in mind: faster hashing makes brute force attacks more efficient. The fact that sha512crypt is slower isn't necessarily a bad thing. It would be nice if it didn't use as much CPU, though.

Agreed, my test performed 1000 iterations to match what the commons codec code is doing by default. I don't think we are providing commons-codec with any number of iterations. NIST SP800-63B suggest doing 10,000 iterations, it says:

Even though the javadoc for Crypt incorrectly says the salt is specified without the rounds= parameter, you can specify that parameter to override the default of 5000. However, the Crypt API doesn't make this easy for us... we would have to randomly generate our own salt, base-64 encode it and pass it in, rather than rely on the library generating a random salt for us. And, their utility code to generate the salt and efficiently base64 is not public, so we'd have to rewrite it or copy it.

It would be nice to have an upstream change that makes it easier to specify the rounds (and also to support newer algorithms, like yescrypt). commons-codec is a bit out of date in this regard.

For PBKDF2, the cost factor is an iteration count: the more times the PBKDF2 function is iterated, the longer it takes to compute the password hash. Therefore, the iteration count SHOULD be as large as verification server performance will allow, typically at least 10,000 iterations.

So, my test is faster doing an apples-to-apples comparison, but would be slower when performing 10,000 iterations. We would still need the caching, IMO, since we are currently checking that the passwords match on each API call.

Yours is not faster. The default is 5000, not 1000. Your test code uses 1000. When I ran your test with 5000, Crypt was faster. When I pre-generate the salts for Crypt, like you did with the PBKDF2, then it's slightly faster still.

But the question remains: do we want fast and efficient checks, or do we want slow checks/large iterations? The original issue was about performance... but if the performance is intentionally slower here, if lock contention is intentional to throttle as a mechanism to protect against brute force attacks, perhaps there's no bug here to fix? Can we have both? Throttled, but less CPU usage?

dlmarion commented 2 years ago

The default is 5000, not 1000.

You're right, I conflated that with ROUNDS_MIN.

But the question remains: do we want fast and efficient checks, or do we want slow checks/large iterations?

I get the rationale behind the slow checks / large iterations. I don't know that the lock contention is intentional as a throttle mechanism. I have no issue with that happening the first time to make sure that the password passed in is in fact the correct one. I think we need to do something here based on @keith-turner 's comment in #2707 (regarding the difference between caching and not caching successful password checks under load):

With these changes and 500 concurrent queries the average time is 400ms and the max times are around 1000ms (ignoring the first few lines because JIT probably has not kicked in scan servers). Without these changes the average times are around 1600ms to 1700ms and the max times 15,000ms to 20,000ms

I can't imagine that this is the desired behavior. Was it this bad before the change to commons-codec in #1798?

ctubbsii commented 2 years ago

I can't imagine that this is the desired behavior. Was it this bad before the change to commons-codec in #1798?

I don't know. We didn't use commons-codec before, but we did use a SHA-2 algorithm, SHA-256. I think those changes are worth any side-effects, though, especially if it's being mitigated by #2707 or another means.

keith-turner commented 2 years ago

2707 resolves this problem