Use a more efficient hash function in the Merkle tree

AntoineRondelet commented 5 years ago

See this interesting thread: https://github.com/zcash/zcash/issues/2234 See MiMC: https://eprint.iacr.org/2016/492.pdf

rrtoledo commented 5 years ago

This reminds me (Pedersen commitments) of Waters Signature and Smooth Hash Functions from "Round-Optimal Privacy-Preserving Protocols with Smooth Projective Hash Functions" by Blazy et al. [1]. More interestingly, this paper is based on "Programmable Hash Functions and Their Applications" by Hofheinz et al. [2] which introduces the notion of "programmability" which helps at determining the security of a group hash function, i.e. the collision resistance.

[1] https://link.springer.com/content/pdf/10.1007/978-3-642-28914-9_6.pdf [2] https://link.springer.com/content/pdf/10.1007/978-3-540-85174-5_2.pdf

p.s. in "Short Blind Signatures" by Blazy et al. [3], the authors discuss about a generalization of Waters signatures and group hash function with input from a Field Zp and not from GF(2).

[3] https://pdfs.semanticscholar.org/4d1e/34e9d2da9a334e3986c69ef217c2f20a93e9.pdf

HarryR commented 5 years ago

Ethsnarks has implemented two relevant hash functions suitable for SNARK circuits:

https://github.com/HarryR/ethsnarks/blob/master/src/jubjub/fixed_base_mul_zcash.cpp
- https://github.com/HarryR/ethsnarks/blob/master/src/gadgets/mimc.hpp

rrtoledo commented 5 years ago

Note: It could also be worth looking at accumulators and proofs of inclusion.

HarryR commented 5 years ago

There are a handful of problems with accumulators (that aren't merkle trees):

1) For RSA accumulators this requires you to do large field arithmetic inside the SNARK circuit 2) For accumulators using bilinear pairings you need a pairing-friendly curve inside the SNARK circuit 3) Trusted setup for RSA accumulators 4) Primality tests inside a SNARK circuit are very expensive making RSA accumulators infeasible impossible in practice 5) There must be easier methods to hash data than using an accumulator

There is an open problem though, related to RSA accumulators inside a SNARK - when you don't need to do primality tests inside the SNARK - how do you efficiently perform operations like exponentiation over a field a 2048bit RSA modulus?

Say we have F_p which is the SNARK scalar field, we then need a way of representing elements within a field of a known RSA modulus as polynomials within F_{p^k} of degree k - 1.

See:

rrtoledo commented 5 years ago

Agreed, but have you looked at ECC based accumulator (so far I have only found one [1]) ?

If there was more research in this area it could be interesting to see how well this can be used in conjunction to Pedersen hashes.

[1] "Distillation Codes and Ap-plications to DoS Resistant Multicast Authentication" : http://people.eecs.berkeley.edu/~tygar/papers/Distillation_code.pdf

HarryR commented 5 years ago

Could you elaborate on how error-correction codes could be used as an accumulator? The paper you linked seems to use a Merkle-tree, a UOWHF and RSA signatures as part of the construct, rather than using error-correction codes as a type of hash.

A good candidate for a UOWHF is something abstractly similar to Poy1305, which requires one constraint per message, however it is unsuitable for many uses (such as general purpose cryptographic hashing) because it doesn't offer pre-image collision resistance - but as a MAC or data integrity tag where the key is known by only the sender and receiver it can be used securely as it (and pretty much all other UOWHFs) only offer target collision resistance.

One question I'm still pondering is whether you could use Poly1305 together with Merkle-Damgard, Davies-Meyers or Shoup constructions to gain preimage resistance and other desirable properties which would be suitable for a Merkle-tree, and would result in a very low number of constraints (4-5 per element being hashed, rather than ~200.

Currently, MiMC with exponent of 7 and 46 rounds requires ~190 constraints per element being hashed, and Pedersen Hash requires many many more (but is still comparatively very cheap compared to SHA256 and Blake). Reducing that to 5-10 per element would result in sub-second proving time for many simple circuits that do 50-100 hashes (which is awesome).

See:

Hash Functions: From Merkle-Damgard to Shoup which analyses the Shoup construct using UOWHFs

rrtoledo commented 5 years ago

My bad, I meant elliptic curve based accumulators (articulating them with Pedersen hashes should make more sense). This paper is cited quite a few times about its EC accumulator but there isn't much at all on it actually.

HarryR commented 5 years ago

All of the elliptic curve based accumulators I've seen seem to rely on bilinear pairings and have a trusted setup phase. Bilinear pairings are not very efficient within a SNARK (although Daira, O1-labs etc. are working on that) and require the construction of a pairing-friendly curve which can operate within a zkSNARK circuit.

The only options so far seem to be the MNT4/MNT6 curve cycles, which are not compatible with Ethereum, if ditching Ethereum compatibility is an option you can switch over to a different curve, then you can do pairings inside a SNARK circuit.

However, it will be more expensive (in terms of the number of constraints) to prove membership of an item in the accumulator when using one of the bilinear pairings accumulator schemes, especially compared to MiMC and the ZCash Pedersen Hash scheme.

Accumulators using bilinear pairings:

rrtoledo commented 5 years ago

Thks for the list, I went through it and looked at the RSA and others. I hoped there was more papers in the domain but it does not appear very popular.

I have found some description of the EC-based (non bilinear pairing) accumulator; it is rather simple but not useful as it is a static accumulator (c.f. Performances of Cryptographic Accumulators): "Karlof et al.in [16] use elliptic curve to build an accumulator. To accumulate the values (scalars) they are multiplied with the public-key (i.e.a scalar times Performances of Cryptographic Accumulators7the base point of the curve). Witness generation follows the same algorithm but excludes the corresponding value. Verification is simple and involves checking for equality if the multiplication of the witness and the value equals the accumulated value."

HarryR commented 5 years ago

I like that paper, it starts off very specific and thorough with the initial definitions, but then gets very very wishy washy.

However, the reference for Karlof et al, in [16] use elliptic curve to build an accumulator... is wrong, nowhere in the Distillation Codes and Applications to DoS Resistant Multicast Authentication paper do they use elliptic curves.

§4.1.1 As discussed by Bari´c and Pfitzmann [1] a stronger notion of security called collision-freeness is achieved when the values added to an accumulator are primes or prime representatives of non-prime values.

I assume they mean that a HashToPrime function is used to map non-prime values to distinct items which can be accumulated.

§4.1.2

They specify no mechanism for their elliptic curve based accumulator without bilinear pairings, but go into enough detail about RSA and symmetric accumulators (aka Bloom filters) for me to understand what they mean, and the only paper they referenced regarding elliptic curve based accumulators does not use elliptic curves in any way... nor is there any code to back up their benchmarks (lol, academia...)

My understanding is that outside of Bloom filters and Merkle trees, there is no trapdoor-free accumulator which is both space efficient and has a sub log(n) witness size.

There are interesting properties for elliptic curve group signatures though, which could possibly be abused to prove membership of a single bit with a constant sized witness of one or two group elements without using bilinear pairings... From which you could do something like a bloom filter. I will ponder on that one (check the MuSig paper, but it wouldn't be a dynamic accumulator...)

rrtoledo commented 5 years ago

However, the reference for Karlof et al, in [16] use elliptic curve to build an accumulator... is wrong, nowhere in the Distillation Codes and Applications to DoS Resistant Multicast Authentication paper do they use elliptic curves.

I completely agree, I spent quite some time looking at other citations / full papers... but couldn't find the actual paper.

nor is there any code to back up their benchmarks (lol, academia...)

Academia is not that bad... Code-wise, it is getting better (technically I'm still currently academia).

My understanding is that outside of Bloom filters and Merkle trees, there is no trapdoor-free accumulator which is both space efficient and has a sub log(n) witness size.

Agreed. But that is not a surprise, for soundness you need a linear cost (in the number commits) somewhere. I don't see that as much as a problem as for Merkle tree you do have to store the hashes of all commitments. If we can have better proof (computation wise for snakes) versus hopefully not too big overhead in space that could be acceptable.

There are interesting properties for elliptic curve group signatures though [...]

I was thinking about that yesterday night (if we could reuse standard group signature membership process as a simple accumulator). We would need a partially dynamic group signature for our purpose though (addition but no revocation of group members) (https://eprint.iacr.org/2016/368.pdf)

(check the MuSig paper, but it wouldn't be a dynamic accumulator...)

What is that paper ?

HarryR commented 5 years ago

MuSig:

Some papers referenced by 2016/368:

BSZ05; Foundations of Group Signatures: The Case of Dynamic Groups - joining the group is an interactive process, it requires an initial trusted setup and semi-trusted interactive group manager. Opening the group signatures requires access to the public keys of every member of the group - so... doesn't really work as an accumulator with fast / lightweight verification.
Gro06; Simulation-Sound NIZK Proofs for a Practical Language and Constant Size Group Signatures - Uses bilinear pairings
BW06; Compact Group Signatures Without Random Oracles - Uses bilinear pairings

TBH I think the only options are:

Bloom filter: not succinct or secure, requires pushing around hundreds of megabytes to gigabytes of data, easy to brute force
RSA Accumulator: can be done inside a zkSNARK with ~100k constraints using xjsnark, can use the RSA factoring challenge numbers to make trusted setup easier
- Bilinear Accumulator: not compatible with Ethereum without great expense (millions of constraints, unless you have a cycle of curves like MNT4/MNT6)
- Merkle tree: trustless, very cheap hash functions (MiMC, Pedersen Hash) allow for ~1-2 second proving time with a depth of 30.

The best at the moment is a Merkle Tree with MiMC, followed by Pedersen Hash. Although Pedersen Hash cannot be efficiently computed on EVM, which leaves MiMC (~20k gas, ~200 constraints) and SHA256 (~10 gas, ~27k constraints)

riemann89 commented 5 years ago

Hi @HarryR thanks for your insights, really appreciated. Where I can find more info about 20k gas cost of MiMc you mentioned? Thanks.

HarryR commented 5 years ago

@riemann89 see https://github.com/HarryR/ethsnarks/blob/master/contracts/MiMC.sol

rrtoledo commented 5 years ago

Hi Harry @riemann89 and I had a look at your implementation and we are not sure to understand your algorithm. We cannot see any capacity, rates and squeezing phases. Could you explain us the logic of your hash more in detail? :)

HarryR commented 5 years ago

Hi @rrtoledo

This uses the Myaguchi-Preneel one-way compression function construction.

This ticket has relevant details on decision, tradeoffs etc. https://github.com/HarryR/ethsnarks/issues/119

We weren't able to use the sponge construction, because without being able to choose the prime for the field the rate required to have a good capacity would be far too low, or would require 10x-100x more constraints to achieve an acceptable security level (which, would be a bit pointless no?)

HarryR commented 5 years ago

There are two other candidates for efficient ciphers / hash functions which are worth investigating:

I am interested in how fast they would be on EVM, alternatives using the SIS problem and lattice cryptography could make an interesting homomorphic primitive that can be used as a CRHF: http://www.wisdom.weizmann.ac.il/~oded/COL/cfh.pdf

While the number of constraints is low (just n polynomial evaluations), its not practical on EVM and is very memory intensive etc, e.g. m would be ~30k and nm 32 byte field elements is quite huuge. But you can use this as a basis for sigma protocols too rather than just hashing, e.g. https://github.com/Silur/SIS-gadget/issues/2

rrtoledo commented 5 years ago

I haven't read yet Vision and co. We'll take some time next week and come back to you. :)

As for Poseidon and co., it does look good indeed. Using affine functions for the permutation should really reduce the gas cost. What bothers me however is that the Hades strategy paper is not available, I'd like to read it to understand better the security guarantees.

rrtoledo commented 5 years ago

We weren't able to use the sponge construction, because without being able to choose the prime for the field the rate required to have a good capacity would be far too low, or would require 10x-100x more constraints to achieve an acceptable security level (which, would be a bit pointless no?)

I am not sure to understand well why you say we cannot choose the prime. I thought that we could choose any prime, preferably of size lower than 256 bits to work well with ethereum. Then it's rather straightforward: we want a security s=128, this gives us a t=64, hence n= 257, r=128 and c=129 (if using MiMCHASHn/n). If we want to hash 512 bits, we would need 4 absorption rounds and 256/64 = 4 rounds of squeezing to get a hash of 256 bits.

HarryR commented 5 years ago

With zkSNARKs on Ethereum we're stuck with a specific prime, not just any prime that fits into a 256bit word. That is, unless EIP-1829 or EIP-1962 get included. The prime we're stuck with is about ~253 bits. (the curve's scalar field).

So, with 4 absorption rounds and 4 rounds of squeezing, with an extra 2 field->bits decompositions on the inputs, 4 bits->field recomposition on absorption, then an additional 4 field->bits decompositions on squeezing and one bits->field recomposition for the resultant field element. Each field->bits requires 254 constraints. So... if MiMC with an exponent of 7 and 91 rounds requires ~365 constraints. To hash two field elements to one field element, 8 rounds of MiMC (2920 constraints), ~512 constraints for input->bits, ~1024 constraints for outputs->bits, is about 4500 constraints.

That's still about 6x better than SHA256 on a pure constraints-vs-constraints, but the constraints for MiMC are more complex (there's not a linear constraints vs CPU time). And it's still 2.5x worse than the Pedersen hash scheme that ZCash uses. Oh, and on EVM that's going to cost you a lot more than 20k gas, although it may still be cheaper than Pedersen hash.

The goal, at the time, was to find a scheme which can be used a CRHF for a Merkle-tree, which avoids field<->bit decompositions, has fewer constraints & faster proving time than ZCash's Pedersen hash, and can compute a full ~30 depth Merkle-tree path on EVM in about 0.5m gas. MiMC+OWF seemed to hit all those goals.

rrtoledo commented 5 years ago

@HarryR As you have worked quite extensively on MiMC I wonder if you can answer a few questions I have on my mind.

When talking about mimc_hash it seems you are refering to an iteration of MP compression function (over mimc encryption scheme). Why don't you call this Merkle Damgard ? Is it because you do not do the padding or finalization phase ? In that case, don't we loose some security (the main attack I can see (length extension) should not apply in our case study) ?

I was wondering in the case of using MiMC Miyaguchi-Prenel as a PRF. Specifically, in ZCash the tags 0x00, 0x01... are used to create non overlapping PRFs. We were wondering if we could get the same (non overlapping) property by using different ivs for the round constants (c_o = sha3(iv), ci = sha3(c{i-1}) ) of MiMC underlying block cipher. We'll trying to formalize this, do you see any security problems with that? I was wondering about it as I am not sure about the PRFness of the MiMC Miyaguchi-Prenel Merkle-Damgard construction. I believe MD has its own problems and proving that MiMC MP MD is a PRF is not that straightforward (or requires the padding / finalization).

HarryR commented 5 years ago

Why don't you call this Merkle Damgard ? Is it because you do not do the padding or finalization phase ?

Yes, there is no padding or finalisation to prevent length extension attacks. You would need to wrap your input string with some kind of padding and length prefix / finalisation to gain that property.

I figured that using different IVs for every instance of the hash function is sufficient to prevent those kinds of attacks because each individual use has its own domain and a fixed number of inputs. This also means there's less overhead in terms of the number of constraints.

Specifically, in ZCash the tags 0x00, 0x01... are used to create non overlapping PRFs. We were wondering if we could get the same (non overlapping) property by using different ivs for the round constants (c_o = sha3(iv), ci = sha3(c{i-1}) )

Yes, using a different set of round constants will create a separate domain for the hash. This would be equivalent to changing the seed used for the round constants.

Using a different IV also creates a different-ish domain, but in a weaker sense. (well, it's a different starting point, from within the same domain, so the work to interpolate one domain can be re-used with different IVs, but with different round constants you need to perform the work for each different set of constants)

AntoineRondelet commented 5 years ago

Closing this issues as it has been merged to develop

clearmatics / zeth

Use a more efficient hash function in the Merkle tree #4