Open petersn opened 3 years ago
Required context: https://peter.website/meow-hash-cryptanalysis
Thank you for doing this. I'm impressed by the work you put into this, far beyond just producing proof of a break.
Thoughts and comments in no particular order:
As far as I recall I had instruction patterns without the 1-byte offset with some 60 to 70ish bits of security against patterns like these, but that wasn't 128 bits, so I had to get creative. I thought the offset was an improvement, but my brute force check couldn't run in reasonable time any more.
Casey really wanted to hash 16 bytes per clock cycle, and I really wanted the key to do something meaningful, in hindsight that combination was obviously a stretch.
I still feel like there is a lot of good ideas in the design. It doesn't seem like the attacks demonstrated are quite as damning as the comparison to other hash functions suggest they should be.
I'm somewhat surprised that the known-key attacks took as much work as they did.
This write-up is awesome! Just fantastic.
As the resident non-cryptographer, other than just changing the README to remove the level 3 claim (and linking to the write-up?), are there recommendations for updates as we move toward v1? Before the crypto part, Meow was originally just supposed to be an asset matching tool, so that is still my primary concern, and was wondering if there were things we should be doing to make it less likely that Meow would collide by accident.
Of course, if @NoHatCoder still wants to pursue level 3 security, that's fine with me too - but my priority is just to make sure that the hash requires an attacker before it fails, since the use cases I always envisioned for it have no attacker. And generally speaking, although there is a certain attitude sometimes of "well if you're just checking assets, use whatever", in my experience thus far I have found that trivially insecure hashes often do produce collisions by accident under heavy use, so, I think it's still worth it to try to have a fast hash that is "as strong as it can be".
Thanks, - Casey
I'd have to think very carefully to give more solid recommendations, but a few off the top of my head:
aesdec
instructions, but just more message injection in general would probably help.aesdec
instructions one can have in flight at a time (on my laptop it seems like maybe four from my extremely naive testing?), but if at all possible chaining the AES rounds more sequentially would probably help a fair bit.aesdec
instructions) and two additions. Further, the mix-columns step in the AES round itself is a 32x32 matrix multiplication over GF(2), which is of course algebraically compatible with xor but not addition. In short, I don't think you're at much risk of becoming too ℤ/2⁶⁴ℤ-linear, but I think the design is currently a bit too ℤ/2ℤ-linear. (There are currently about six ℤ/2ℤ-linear operations per meow mix, and only two ℤ/2⁶⁴ℤ-linear operations.) As an example of this, if Meow hash were changed to only use xor instead of addition, the vanishing characteristic I gave would be 2²⁷ times as likely due to the lack of carries, and key recovery would become six or seven orders of magnitude faster. But perhaps it's nice that the message bytes are incorporated two different ways; one would have to think about this extremely carefully.i1, i2, i3, i4
values being block[15:31], block[:16], block[1:17], block[16:]
, while the residual blocks meow mix with the four values being "\0" + block[:15], block[:16], block[17:] + block[:1], block[16:]
. If there was no particular reason for this, I would simplify things and absorb them the same.Overall though, I think aiming for MAC security at this speed will be tricky. It's possible I'm not aware of some advances in the field, but I think it's plausible achieving a 50 GiB/s single-threaded 128-bit MAC is in the realm of a serious research project (non-doctoral). Although I'd have to check how fast the fastest parallel Wegman-Carter style MACs run, perhaps 50 GiB/s is already common. If you do wish to attempt to fix up Meow hash for level 3 security without sacrificing any performance, perhaps you could:
aesdec
. I'm not familiar enough with performance engineering at this level to know what's possible, but it seems tricky to get enough in there to really be confident in MAC security. From comparison to functions like PELICAN I'd very roughly guess that Meow hash has somewhere around 25%-50% of the non-linear mixing needed to be secure with its general high level design.It turns out that there aren't a whole lot of other suitable instructions, multiplication is the only real candidate, it won't allow us to do more instructions per clock, but maybe we can once in a while get a multiplication instead of an add. I assume that you are familiar with the pitfalls of multiplications. I won't rule it out completely, but I don't think it is going to save the day.
For a lot of utility functions like compression, encryption and hashing it initially seems like a cool idea to build it for multi-core parallelism. In reality it turns out that for most cases the overhead of thread coordination is significant, and the program itself can utilize more cores much more efficiently by partitioning tasks in a different manner. So for hitting some marketing number, yes, for actual useful speed, not so much.
Actually, 256 and 512 bit AES instructions are being introduced to X86, so utilizing those could be an option. The reasons not to do so is that wide instructions makes modern X86 processors clock down, so while a program that use a small hash here and there might be faster when hashing, everything else will run at a slower speed. Unless of course if the program is committed to using wide instructions, but it is such an awkward decision to take in a library. And of course it won't work on a lot of old processors, so you have to provide a 128 bit version, but with the large state design that will overflow the registers, so now we need a lot of extra mov instructions to keep it working.
Recent benchmarks are hard to come by for most old functions, but I can't find anything suggesting that any of the resident MAC functions are anywhere close to this speed. Wikipedia says "3.1n + 780 Athlon cycles" for Poly1305 (apparently a 2000 Athlon was the best Bernstein could get his hands on in 2005), but if we check out the numbers at https://www.bearssl.org/speed.html there are some way deeper depths of inefficiency that Poly1305 can apparently go to if not carefully hand-optimized. It seems like we could ditch a lot of speed and still be handily faster.
Speaking of changing the rules, I have also done some work on the question "What if we had a way better crypto instruction?", the process of first making a primitive, and then designing instructions for it doesn't seem to produce very good instructions. So https://github.com/NoHatCoder/Chaotic-Permutation-Circuit But of course this is ~10 years from mainstream use even if we convinced CPU manufacturers to adopt a new instruction today. I'm not sure if I'm the right person to do this, but nobody else is doing it (or they are doing something slightly wrong in a way that breaks critical properties), and I can't unsee the potential for getting a 2-3x speedup over AES-NI primitives.
@cmuratori I'd estimate the combined chance that a non-adversarial change would match the attack pattern and actually succeed to produce the collision scenario is below 2^-128, but only just. I don't think it is likely that we will see any big "improvements", and given that 128 bits really is plenty for any level 1 use case I'm not particularly concerned.
I think that if we downgrade to level 1 we should leave out the seed, ditch the shifted input, and concentrate a bit more on good ARM performance, if we used an XOR-AES-XOR permutation it could be implemented on both X86 and ARM without wasting the built-in XOR on either platform, toying a bit with that idea at the moment.
We need a carefully worded update on the security status. Unlike a level 5 hash broken to 2^45 strength, which we would call trivially exploitable, this attack is inherently online, meaning that an attacker would typically have to send or tamper with 2^45 messages. Depending on exact settings this would be most often not practically feasible. Still I think we should recommend migration for level 3 purposes, especially as we can't exclude the possibility of an improved attack. But we don't need to cause too much panic, and I think it is important to point out that it is still a better and/or faster level 1 hash than pretty much anything else.
4. I think the naive change would be a bad idea for a level 3 function, you'd expose a number of known-state lanes, turning part of a probabilistic online attack into a deterministic offline attack. So with a shorter key we need to do proper key expansion.
That's a good point, what I was suggesting was a bit naive. Presumably you'd actually want to do something more like what SipHash does, and initialize your lanes something like:
lane0 = const0 + key0
lane1 = const1 + key1
lane2 = const2 + key2
lane3 = const3 + key3
lane4 = const4 + key0
lane5 = const5 + key1
lane6 = const6 + key2
lane7 = const7 + key3
Where you'd want to think carefully about the implications of this symmetry. Also this suggestion is relatively unimportant, because actually weak keys (like the symmetrize32
one I show) are highly contrived, and even if a user accidentally provides a somewhat weak symmetrical key and accidentally only hashes symmetrical input Meow hash still has 512 bits of internal state, and symmetry will be broken by the length block. It might not be worth thinking about.
Again as the resident non-cryptographer, I can only really add performance commentary:
On x64, AESDEC generally issues once per clock. So you can do one every cycle. It takes four cycles to complete, so you can't use the results immediately, though.
XORs vs. ADDs doesn't make any difference usually, unless it's some kind of special add. If you want more adds and less xors, they should be free to substitute (on x64 - ARM I have less experience with, so that may be different).
Regarding "explicitly parallel", I don't think threading is particularly interesting because any hash can always be used threaded to improve the speed, so I'm not sure that building this into the hash does anything in particular. We could issue a standard and/or example saying "here is how you compute the hash in a tree form" or something, just to make it easier to standardize.
That said, I do think "parallelizing" across 256 or 512-bit lanes might be the right way to go. AVX 256-bit instructions in particular are terrible, and they hate crossing the two 128-bit lanes. So something that was natively designed to process two independent 128-bit streams would vastly outperform something that wasn't on AVX.
AVX-512, on the other hand, is very hard to say. It's extremely janky right now and basically doesn't exist on the desktop (except in laptops). So I think ti's way too early to say what would be a smart design there, although I suspect the same thing holds. So designing a "four wide" input pipe might not be a bad idea, just looking forward. I don't know how this relates to Neon, however.
- Casey
Did a take on a disclaimer. I have worded it with a higher priority for authentication customers than hash table customers as the hash table case requires some pretty serious timing attacks that are very hard to pull off in the required quantity in the wild. Comments?
Due to recent discoveries by Peter Schmidt-Nielsen, we have decided to reclassify Meow hash 0.5/calico from level 3 to level 1. This means that we recommend not to use this hash for message authentication codes, or for hash tables in scenarios where collision induced denial-of-service attacks are a concern.
We have seen no evidence that the hash is unfit for non-adversarial/non-cryptographic purposes, and continue to believe that it is amongst the best in this regard.
For level 3/MAC capabilities consider migrating to SipHash. Do not migrate to any hash not advertising MAC capabilities as these are almost certainly much weaker than Meow 0.5. If the performance of SipHash is not satisfying, continuing to use Meow 0.5 for hash tables is better than migrating to another fast hash. While Meow 0.5 also continue to provide some useful strength for message authentication codes, we have to stress that we strongly recommend migration in this case.
Seems good to me.
For the time being, I went ahead and pasted that security warning into the README, just so there would be something up there while we work out what to do going forward.
- Casey
Maybe include a link to the analysis, seems weird not to have that.
Done.
- Casey
The "about" section still states that its a level 3. Maybe change that too for the time being to not be confusing.
Roger that - should be updated now?
- Casey
Looks good to me :)
Hello,
Your web page for Meow hash says to report collisions on GitHub. I have two PNG files that
meow_search
reports as colliding:The files are here:
Thanks.