cfrg / draft-irtf-cfrg-hash-to-curve

Hashing to Elliptic Curves
Other
79 stars 27 forks source link

new hash_to_field code / expand_message_md revisited #214

Closed kwantam closed 4 years ago

kwantam commented 4 years ago

This PR updates the code in poc/ to use the new hash_to_field function from #212.

I have not yet regenerated the test vectors. Shall I do that?

burdges commented 4 years ago

Is pairing-plus or whatever up to date?

kwantam commented 4 years ago

Is pairing-plus or whatever up to date?

Not yet, that's next on my list.

mratsim commented 4 years ago

New test vectors would be very helpful

armfazh commented 4 years ago

Using this new method for obtaining field elements, does it still preserves the guarantees required by the random oracle construction? Previously, u0 and u1 were obtained from different RO instantiations, by means of appending an extra bit. u0 = H (M || 0) and u1 = H (M || 1) Now, it seems to me that we are querying the same RO for obtaining u0 and u1.

kwantam commented 4 years ago

Using this new method for obtaining field elements, does it still preserves the guarantees required by the random oracle construction?

Yes.

There are two ways to argue this. One is to recall that the underlying RO-to-curve construction is really defined in terms of independent elements of F. Using two hash functions H0 and H1 that are modeled as independent random oracles satisfies this requirement, but it's not the only way to satisfy it---hash_to_field (with count=2) also gives two independent elements of F.

The other slightly less direct way is to observe that the output of expand_message inside of hash-to-field gives 2 * m * L bytes that are uniformly distributed, and therefore we can define two independent random oracles in terms expand_message, just by splitting its output into two non-overlapping chunks of m * L bytes. This is precisely what happens inside of hash-to-field, so we can understand that function as essentially returning the evaluations of H0 and H1.

(These are equivalent views of the world.)

armfazh commented 4 years ago

Let's assume H(m,2) and that exactly b0 and b1 match the sizes of certain prime field. Then, u0 = b0 mod p, and u1 = b1 mod p. My concern is that b1 depends only on b0, since b1 was calculated as the hash of b0 (plus some prefix). So, take a message m and calculate b0. At this point, b1 is completely determined by b0, but not by other call to a RO using the original message. (in some sense b0 fixes the output of b1). I am probably missing something. thoughts?

kwantam commented 4 years ago

I see what you're saying.

The scenario you describe doesn't quite work because b0 isn't output, only b0_chopped. But I think the idea is correct: consder the case where b0_chopped || b1 is the 1st element and a substring of b2 || b3 is the second element. Then b2 and b3 are fully determined by b1, and the adversary sees all of b1, so we still have the issue.

I went back and looked at CDMP08, and now I see where the issue is coming from. Section 5 talks about building longer outputs by computing H(X || 1), H(X || 2), etc. The important thing here is that there's no chaining of output blocks in that construction. And as you say, using chaining on values that the attacker knows makes it possible to distinguish.

(In contrast, HKDF does use chaining, but always combines the chained value with a secret value that the attacker never learns, namely, PRK.)

I think the fix is straightforward (don't chain, just hash b0 over and over), but I'm going to think carefully about it and push some new text / code.

Thank you!!!

armfazh commented 4 years ago

The scenario you describe doesn't quite work because b0 isn't output, only b0_chopped. But I think the idea is correct: consder the case where b0_chopped || b1 is the 1st element and a substring of b2 || b3 is the second element. Then b2 and b3 are fully determined by b1, and the adversary sees all of b1, so we still have the issue.

This is exactly the scenario I had in mind, but I simplified to b0 and b1.

I went back and looked at CDMP08, and now I see where the issue is coming from. Section 5 talks about building longer outputs by computing H(X || 1), H(X || 2), etc. The important thing here is that there's no chaining of output blocks in that construction. And as you say, using chaining on values that the attacker knows makes it possible to distinguish.

(In contrast, HKDF does use chaining, but always combines the chained value with a secret value that the attacker never learns, namely, PRK.)

Exactly, another approach is to generate n values of msg_prime, one per field element requested.

kwantam commented 4 years ago

OK, consider the following:

Steps:
1. ell = ceil((len_in_bytes + k_in_bytes) / b_in_bytes)
2. ABORT if ell > 256
3. DST_prime = I2OSP(len(DST), 1) || DST
4. b_0 = H(DST_prime || I2OSP(0, 1) || I2OSP(len_in_bytes, 2) || msg)
5. for i in (1, ..., ell - 1):
6.   b_i = H(DST_prime || I2OSP(i, 1) || b_0)                   #### change is here
7. b_0_chopped = b_0[0 : (b_in_bytes - k_in_bytes)]
8. pseudo_random_bytes = b_0_chopped || b_1 || ... || b_(ell - 1)
9. return pseudo_random_bytes[0 : len_in_bytes]

This avoids the chaining issue, because the attacker never sees b_0.

If we were extremely paranoid, we might not even output b_0_chopped. This would mean that the minimum number of hash invocations is 2 instead of 1, which will likely make some people unhappy. But that seems unnecessary, by the security argument for chopMD. Informally, an attacker who doesn't know b_0 cannot compute any more blocks, and here the attacker would have to guess k bits of b_0 in order to reconstruct.

So I guess I'm slightly in favor of outputting b_0_chopped, but I can be convinced otherwise. Thoughts @armfazh?


One issue with the above that makes me quite nervous is that now only 1 byte (really, in the worst case, only 1 bit!!!) of the input of H is changing for each b_i. This is quite aggressive.

A slightly better solution would be to do something like

b_i = H(DST_prime || I2OSP(i, 1) || b_0 || b_(i-1))

The big downside of this is that it significantly increases the number of hash invocations. We could get around that with something like:

b_i = H(DST_prime || I2OSP(i, 1) || b_0 ^ b_(i - 1))

where ^ is bitwise XOR. This would require a special case for b_1, obviously!

kwantam commented 4 years ago

A third approach is to chain, but chop the output of every hash function, not just the first one. (The last block does not need to be chopped in this case.)

Let's think about the cost of all three approaches. For all three approaches, we have

b_0 = H(DST || I2OSP(i, 0) || I2OSP(len_in_bytes, 2) || msg)

For subsequent blocks,

Type 6 doesn't really make any sense, but I've included it for completeness.

Let's look first at output rates for different cases:

case first block middle blocks last block
SHA-256, k = 128, type 1 or 2 16 bytes 32 bytes 32 bytes
SHA-256, k = 128, type 3 16 bytes 16 bytes 32 bytes
SHA-512, k = 128, type 1 or 2 48 bytes 64 bytes 64 bytes
SHA-512, k = 128, type 3 48 bytes 48 bytes 64 bytes
SHA-512, k = 256, type 1 or 2 32 bytes 64 bytes 64 bytes
SHA-512, k = 256, type 3 32 bytes 32 bytes 64 bytes

Types 4, 5, and 6 correspond to types 1, 2, and 3 (respectively), except that the "first block" column is always 0 (because b_0 is never output).

Now, let's look at number of compression function invocations. Assume that the input message is 32 bytes and DST is 16 bytes; this is roughly worst case (because long messages will dominate hashing cost).

case first block subsequent blocks
type 1 1 2
type 2 1 1
type 3 1 1

Closed-form expressions for ell and number of invocations. Below, ell is the number of blocks that contribute to the output. For types 4, 5, and 6, one must first compute b_0 and then hash ell blocks.

Using the above, let's consider the cost of different scenarios:

  1. k = 128, p = 256 bits, m = 1 (like Curve25519) a. count = 1 (len_in_bytes = 48), SHA-256 b. as above, SHA-512 c. count = 2 (len_in_bytes = 96), SHA-256 d. as above, SHA-512

  2. k = 128, p = 381 bits, m = 2 (like BLS12-381 G2) a. count = 1 (len_in_bytes = 128), SHA-256 b. as above, SHA-512 c. count = 2 (len_in_bytes = 256), SHA-256 d. as above, SHA-512

  3. k = 256, p = 521 bits, m = 1 (like P521) a. count = 1 (len_in_bytes = 98), SHA-512 b. count = 2 (len_in_bytes = 196), SHA-512

  4. k = 256, p = 581 bits, m = 8 (like BLS48-581 G2) a. count = 1 (len_in_bytes = 840), SHA-512 b. count = 2 (len_in_bytes = 1680), SHA-512

Case type 1 4 type 2 5 type 3 6
1a 3 5 2 3 2 3
1b 1 3 1 2 1 2
1c 7 7 4 4 5 6
1d 3 5 2 3 2 3
2a 9 9 5 5 7 8
2b 5 5 3 3 3 4
2c 17 17 9 9 15 16
2d 9 9 5 5 5 6
3a 5 5 3 3 3 4
3b 7 9 4 5 6 7
4a 27 29 14 15 26 27
4b 53 55 27 28 52 53

Not surprisingly, type 2 is the cheapest across the board.

Maybe slightly more surprisingly, types 1 and 3 have close to the same cost, but type 3 is basically always better.


~So my take is that type 2 is the winner, assuming that folks are OK requiring bitwise XOR. Otherwise, we should go with Type 3.~

EDIT I think the "right" choice is really type 5, even though it costs one extra hash invocation for very short messages and very short outputs. So that's my vote.

Type 2 costs fewer hashes, but I'm less certain about its security. I'm roughly as confident in type 4 as type 5, but it's much more expensive. I'm least confident about type 3. And as I said above, type 6 really doesn't make any sense...

kwantam commented 4 years ago

Just to be clear, I'll write down the pseudocode for each one:

All of them start in roughly the same way:

1. ell = ((compute using closed form given above))
2. ABORT if ell > 256
3. DST_prime = I2OSP(len(DST), 1) || DST
4. chopped_length = b_in_bytes - k_in_bytes
5. b_0 = H(DST_prime || I2OSP(0, 1) || I2OSP(len_in_bytes, 2) || msg)

Type 1

6.  for i in (1, ..., ell - 1):
7.    b_i = H(DST_prime || I2OSP(i, 1) || b_0 || b_(i - 1))
8.  b_0_chopped = b_0[0 : chopped_length]
9.  pseudo_random_bytes = b_0_chopped || b_1 || ... || b_(ell - 1)
10. return pseudo_random_bytes[0 : len_in_bytes]

Type 2

5.  b_1 = H(DST_prime || I2OSP(1, 1) || b_0)
6.  for i in (2, ..., ell - 1):
7.    b_i = H(DST_prime || I2OSP(i, 1) || b_0 ^ b_(i - 1))
8.  b_0_chopped = b_0[0 : chopped_length]
9.  pseudo_random_bytes = b_0_chopped || b_1 || ... || b_(ell - 1)
10. return pseudo_random_bytes[0 : len_in_bytes]

Here, ^ is bitwise xor (always applied to equal-length bytestrings in the above).

Type 3

6.  for i in (1, ..., ell - 1):
7.    b_i = H(DST_prime || I2OSP(i, 1) || b_(i - 1))
8.    b_(i-1)_chopped = b_(i-1)[0 : chop_length]
9.  pseudo_random_bytes = b_0_chopped || b_1_chopped || ... || b_(ell - 1)
10. return pseudo_random_bytes[0 : len_in_bytes]
kwantam commented 4 years ago

Actually, it occurs to me that the construction as currently written (i.e., where b_1 chains into b_2) probably is OK (though it probably still makes sense to try and improve it, as long as it's not too costly).

The reason is, the field elements at the output of hash-to-field do not uniquely determine the bytes at the output of expand-message. In all cases, we take (log(q) + k) bits and reduce them to log(q) bits. Hand-wavily, this means that the indifferentiability simulator has about 2^k choices it can make when mapping field elements to byte strings.

The proof is much cleaner if we can just make expand-message a random string, so probably it's best to do that. For this purpose I favor approach 2, since it adds the least overhead...

kwantam commented 4 years ago

OK, I revamped my comment above to include options that don't output any part of b_0.

My vote is to go with "type 5." This is maybe slightly conservative and will make embedded folks slightly sadder. I'll update the document and code with proposed changes (we can always back them out).

chris-wood commented 4 years ago

Type 5 seems like the right compromise on balance. We maintain chaining while also mixing in b_0 and keeping it hidden from the distinguisher, and the concatenated output is indistinguishable from a random string. And the additional XOR cost is negligible.

Thanks @armfazh for pointing out the issue!

kwantam commented 4 years ago

I talked with Dan (Boneh) about "type 5" today, and he had two suggestions:

  1. When computing b_0 = H(DST_prime || I2OSP(0, 1) || I2OSP(ell, 2) || msg), there's a danger that a badly-chosen DST will cause a short msg to be split across the first block boundary. The reason this is bad is that the entropy of msg ends up broken up across two invocations of the compression function, which could end up making some attacks (think: meet-in-the-middle) easier. He suggested that we fix this is by adding extra padding before msg to ensure that it starts right at the first block boundary.

    This is slightly unfortunate because it guarantees one extra invocation of the compression function even for short inputs. In principle we could avoid the extra compression by adding a special case for very short DST/msg combinations. But that's pretty ugly, because it requires knowing a lot about the hash function's construction (e.g., SHA-256 has a minimum of 65 bits of padding after the message, so msg only fits completely in the first block when len(DST) + len(msg) <= 51 bytes.

  2. Similarly, if DST is too long, then b_0 or b_0 XOR b_(i - 1) might end up broken across a block boundary in the b_i calculations. For SHA-256, limiting DST_prime to 22 bytes fixes this; for SHA-512, the magic number is 46 bytes.

It occurs to me that an alternative solution to both of the above issues is to switch the order of the inputs to the hash function; this guarantees that the critical inputs are always right at the beginning. In particular,

b_0 = H(msg || I2OSP(len_in_bytes, 2) || I2OSP(0, 1) || DST_prime)
b_1 = H(b_0 || I2OSP(1, 1) || DST_prime)
b_i = H(strxor(b_0, b_(i - 1)) || I2OSP(i, 1) || DST_prime)

This means we're doing domain separation as a suffix rather than as a prefix of the message. Does this make life more difficult when interacting with other protocols? I'm not exactly sure, but I can't see why it should.

In fact, it arguably makes life better, because any protocol that needs to compute another value of H(msg || ...) can reuse an intermediate state of that computation to save time when computing prk!

Note that both of the worries above apply to SHA-2; for SHA-3, we don't really need to worry about any of this.

kwantam commented 4 years ago

Quick note: I pinged Dan about swapping the input order and he said it looked fine.

kwantam commented 4 years ago

Arrgh. As I was writing down a proof sketch I found an issue with swapping the argument order. Fortunately, it's not hard to fix, and there's a reasonably easy way of avoiding any performance ramifications.

Next, I'm going to push a commit that implements this. As always, we can revert...

Executive summary

Above I proposed computing

b_0 = H( msg || I2OSP(len_in_bytes, 2) || I2OSP(0, 1) || DST_prime )

To make the proof go through, we should change this to

b_0 = H(I2OSP(0, input_block_size) || msg || I2OSP(len_in_bytes, 2) || I2OSP(0, 0) || DST_prime)

In other words, we should prepend 1 (input) block of 0s to msg before hashing. For SHA-256 this is 64 bytes, for SHA-512 it's 128 bytes. This lets us rely directly on the security proofs for NMAC given in CDMP05, as well as other analysis like Bel06.

In the rest of this post I'll first discuss why we ought to do this, and then describe how implementors can avoid actually computing another hash function invocation.

Why do we need this change?

Previously we were arguing that our construction was an implementation of the NMAC construction (really, the confusingly-named HMAC^f construction; for clarity I'll just call it NMAC) in Section 3.5 of CDMP05. The basis for this claim was that the first block of each invocation of H was unique, since we had the DST and the counter in there. (We can understand Dan's suggestion (1) above, padding the first block of the b_0 computation, as strengthening this argument.)

Unfortunately, if we're putting the message first, we can't directly rely on the NMAC proof: since the adversary controls the first input block to the hash function, we have no guarantee that it's unique.

CDMP05 fix this problem by prepending a block of 0s to the message. This works because it prevents the adversary from controlling the contents of the first block. Using our notation, this is secure essentially because the probability is negligible that b_0 == 0 or strxor(b0, b(i-1)) == 0.

What is the cost?

Directly implementing this change costs an extra compression function invocation. But in practice it's possible to avoid this, just by saving the state of the hash function after it ingests 1 block of 0s, and hashing msg starting from that state.

The storage cost to implement this optimization is minimal (less than 100 bytes for the SHA-2 family), and in practice lots of SHA-2 implementations already let you do this. (This is true, for example, of the OpenSSL implementation.)

The other cost here is that it puts us back in a universe where we can't share a hash prefix with other computations of H(msg || ...). But that seems like it's OK---upper-layer protocols are welcome to define a "prehash mode" if they're really worried about the cost of hashing messages.

chris-wood commented 4 years ago

Referencing the proof wholesale is great! Most hash function implementations allow the internal state to be set, so I don't foresee this to be a problem in practice (as you mention).

chris-wood commented 4 years ago

@armfazh are you OK with this change? If so, I’ll merge it.

armfazh commented 4 years ago

I am not sure whether the XOR maintains the security, or gives the attacker a new surface for attacks. I am not able to prove or disprove this claim. Let's move forward.