Full 128-bit collision between two files

petersn commented 3 years ago

Hello,

Your web page for Meow hash says to report collisions on GitHub. I have two PNG files that meow_search reports as colliding:

meow_search 0.5/calico results:
    Root: .
    Completed on: Sat Jun 26 08:38:34 2021
    Files: 2
    Total size: 19.00kb
    Duplicate files: 0
    Files changed during search: 0
    Access failures: 0
    Allocation failures: 0
    Read failures: 0
    [Meow128] Meow 128-bit AES-NI collisions: 1
        E936866F-6965CB25-D2F8F5AA-D04E8914:
            ./icon2.png
            ./icon1.png
    [Meow64] Meow 64-bit AES-NI collisions: 1
        00000000-00000000-D2F8F5AA-D04E8914:
            ./icon2.png
            ./icon1.png
    [Meow32] Meow 32-bit AES-NI collisions: 1
        00000000-00000000-00000000-D04E8914:
            ./icon2.png
            ./icon1.png

The files are here: icon1 icon2

Thanks.

NoHatCoder commented 3 years ago

Required context: https://peter.website/meow-hash-cryptanalysis

Thank you for doing this. I'm impressed by the work you put into this, far beyond just producing proof of a break.

Thoughts and comments in no particular order:

As far as I recall I had instruction patterns without the 1-byte offset with some 60 to 70ish bits of security against patterns like these, but that wasn't 128 bits, so I had to get creative. I thought the offset was an improvement, but my brute force check couldn't run in reasonable time any more.

Casey really wanted to hash 16 bytes per clock cycle, and I really wanted the key to do something meaningful, in hindsight that combination was obviously a stretch.

I still feel like there is a lot of good ideas in the design. It doesn't seem like the attacks demonstrated are quite as damning as the comparison to other hash functions suggest they should be.

I'm somewhat surprised that the known-key attacks took as much work as they did.

cmuratori commented 3 years ago

This write-up is awesome! Just fantastic.

As the resident non-cryptographer, other than just changing the README to remove the level 3 claim (and linking to the write-up?), are there recommendations for updates as we move toward v1? Before the crypto part, Meow was originally just supposed to be an asset matching tool, so that is still my primary concern, and was wondering if there were things we should be doing to make it less likely that Meow would collide by accident.

Of course, if @NoHatCoder still wants to pursue level 3 security, that's fine with me too - but my priority is just to make sure that the hash requires an attacker before it fails, since the use cases I always envisioned for it have no attacker. And generally speaking, although there is a certain attitude sometimes of "well if you're just checking assets, use whatever", in my experience thus far I have found that trivially insecure hashes often do produce collisions by accident under heavy use, so, I think it's still worth it to try to have a fast hash that is "as strong as it can be".

Thanks, - Casey

petersn commented 3 years ago

I'd have to think very carefully to give more solid recommendations, but a few off the top of my head:

I'd love to see the message injection include every byte at least twice, unlike the current design where the first and last byte of each 32-byte block are only included once. I'm not familiar enough with what can run in parallel on modern CPUs to know what operations you can hide during the latency of the aesdec instructions, but just more message injection in general would probably help.
If at all possible I'd love to see a tighter feedback path from AES round to AES round. Accumulating just one AES round of depth per 128 bytes of message is quite sparse. I don't know how this interacts with pipelining, or how many aesdec instructions one can have in flight at a time (on my laptop it seems like maybe four from my extremely naive testing?), but if at all possible chaining the AES rounds more sequentially would probably help a fair bit.
One would have to think carefully about the implications of this, but I might consider using more additions, and fewer xors. The meow mix operation currently uses four xors (two as part of the aesdec instructions) and two additions. Further, the mix-columns step in the AES round itself is a 32x32 matrix multiplication over GF(2), which is of course algebraically compatible with xor but not addition. In short, I don't think you're at much risk of becoming too ℤ/2⁶⁴ℤ-linear, but I think the design is currently a bit too ℤ/2ℤ-linear. (There are currently about six ℤ/2ℤ-linear operations per meow mix, and only two ℤ/2⁶⁴ℤ-linear operations.) As an example of this, if Meow hash were changed to only use xor instead of addition, the vanishing characteristic I gave would be 2²⁷ times as likely due to the lack of carries, and key recovery would become six or seven orders of magnitude faster. But perhaps it's nice that the message bytes are incorporated two different ways; one would have to think about this extremely carefully.
While you're at it I might consider reducing the key size to 512 bits or 256 bits, and fix the remaining lanes as asymmetric constants, although this doesn't matter much.
Not particularly a security suggestion: Personally I would simplify the message absorption by changing the out-of-order absorption near the end, unless that was for some reason I'm missing. Relatedly, the final padded block and the message length block (I believe you call these two the residual blocks in the code?) are absorbed differently from all other blocks. Specifically, a normal block is meow mixed with the i1, i2, i3, i4 values being block[15:31], block[:16], block[1:17], block[16:], while the residual blocks meow mix with the four values being "\0" + block[:15], block[:16], block[17:] + block[:1], block[16:]. If there was no particular reason for this, I would simplify things and absorb them the same.

Overall though, I think aiming for MAC security at this speed will be tricky. It's possible I'm not aware of some advances in the field, but I think it's plausible achieving a 50 GiB/s single-threaded 128-bit MAC is in the realm of a serious research project (non-doctoral). Although I'd have to check how fast the fastest parallel Wegman-Carter style MACs run, perhaps 50 GiB/s is already common. If you do wish to attempt to fix up Meow hash for level 3 security without sacrificing any performance, perhaps you could:

Find other non-linear operations to parallelize with the aesdec. I'm not familiar enough with performance engineering at this level to know what's possible, but it seems tricky to get enough in there to really be confident in MAC security. From comparison to functions like PELICAN I'd very roughly guess that Meow hash has somewhere around 25%-50% of the non-linear mixing needed to be secure with its general high level design.
Simply change the rules of the game, and decide to make an explicitly parallel hash. This was e.g. the design of MD6, which used a Merkle tree construction by default to allow for parallel hashing. Given that PELICAN uses 4x the AES rounds per message byte, perhaps Meow hash could provide MAC-level hashing that saturates memory bandwidth if you were willing to use four threads. I'm not familiar enough with the performance engineering to be sure.

NoHatCoder commented 3 years ago

I think whole shifted input thing is out in any future designs, it didn't do what it was supposed to do, and performance-wise it only works on X86. I don't think there is any more instructions we can do along with AES, many new CPUs can actually do 2 AES operations per clock, but only 4 SIMD operations in total, from that point of view we are already using too many non-AES instructions.
I don't think there is actually much value in tighter coupling, it doesn't matter how long a partially mixed input lingers, what matter is how exposed it is to getting cancelled out in that time, and that is pretty much just a function of the total ingestion to mixing ratio.
Many CPUs have an execution port where the only useful instruction is a bitwise operation, so more additions would be slower in some cases. It is all a balance of trying to choose a mix of operations where you get as much value from each operation as possible, while running as many operations per clock as possible.
I think the naive change would be a bad idea for a level 3 function, you'd expose a number of known-state lanes, turning part of a probabilistic online attack into a deterministic offline attack. So with a shorter key we need to do proper key expansion.
Something something a few cycles, I think Casey might have gone a bit far here.

It turns out that there aren't a whole lot of other suitable instructions, multiplication is the only real candidate, it won't allow us to do more instructions per clock, but maybe we can once in a while get a multiplication instead of an add. I assume that you are familiar with the pitfalls of multiplications. I won't rule it out completely, but I don't think it is going to save the day.

For a lot of utility functions like compression, encryption and hashing it initially seems like a cool idea to build it for multi-core parallelism. In reality it turns out that for most cases the overhead of thread coordination is significant, and the program itself can utilize more cores much more efficiently by partitioning tasks in a different manner. So for hitting some marketing number, yes, for actual useful speed, not so much.

Actually, 256 and 512 bit AES instructions are being introduced to X86, so utilizing those could be an option. The reasons not to do so is that wide instructions makes modern X86 processors clock down, so while a program that use a small hash here and there might be faster when hashing, everything else will run at a slower speed. Unless of course if the program is committed to using wide instructions, but it is such an awkward decision to take in a library. And of course it won't work on a lot of old processors, so you have to provide a 128 bit version, but with the large state design that will overflow the registers, so now we need a lot of extra mov instructions to keep it working.

Recent benchmarks are hard to come by for most old functions, but I can't find anything suggesting that any of the resident MAC functions are anywhere close to this speed. Wikipedia says "3.1n + 780 Athlon cycles" for Poly1305 (apparently a 2000 Athlon was the best Bernstein could get his hands on in 2005), but if we check out the numbers at https://www.bearssl.org/speed.html there are some way deeper depths of inefficiency that Poly1305 can apparently go to if not carefully hand-optimized. It seems like we could ditch a lot of speed and still be handily faster.

Speaking of changing the rules, I have also done some work on the question "What if we had a way better crypto instruction?", the process of first making a primitive, and then designing instructions for it doesn't seem to produce very good instructions. So https://github.com/NoHatCoder/Chaotic-Permutation-Circuit But of course this is ~10 years from mainstream use even if we convinced CPU manufacturers to adopt a new instruction today. I'm not sure if I'm the right person to do this, but nobody else is doing it (or they are doing something slightly wrong in a way that breaks critical properties), and I can't unsee the potential for getting a 2-3x speedup over AES-NI primitives.

NoHatCoder commented 3 years ago

@cmuratori I'd estimate the combined chance that a non-adversarial change would match the attack pattern and actually succeed to produce the collision scenario is below 2^-128, but only just. I don't think it is likely that we will see any big "improvements", and given that 128 bits really is plenty for any level 1 use case I'm not particularly concerned.

I think that if we downgrade to level 1 we should leave out the seed, ditch the shifted input, and concentrate a bit more on good ARM performance, if we used an XOR-AES-XOR permutation it could be implemented on both X86 and ARM without wasting the built-in XOR on either platform, toying a bit with that idea at the moment.

We need a carefully worded update on the security status. Unlike a level 5 hash broken to 2^45 strength, which we would call trivially exploitable, this attack is inherently online, meaning that an attacker would typically have to send or tamper with 2^45 messages. Depending on exact settings this would be most often not practically feasible. Still I think we should recommend migration for level 3 purposes, especially as we can't exclude the possibility of an improved attack. But we don't need to cause too much panic, and I think it is important to point out that it is still a better and/or faster level 1 hash than pretty much anything else.

petersn commented 3 years ago

4. I think the naive change would be a bad idea for a level 3 function, you'd expose a number of known-state lanes, turning part of a probabilistic online attack into a deterministic offline attack. So with a shorter key we need to do proper key expansion.

That's a good point, what I was suggesting was a bit naive. Presumably you'd actually want to do something more like what SipHash does, and initialize your lanes something like:

lane0 = const0 + key0
lane1 = const1 + key1
lane2 = const2 + key2
lane3 = const3 + key3
lane4 = const4 + key0
lane5 = const5 + key1
lane6 = const6 + key2
lane7 = const7 + key3

Where you'd want to think carefully about the implications of this symmetry. Also this suggestion is relatively unimportant, because actually weak keys (like the symmetrize32 one I show) are highly contrived, and even if a user accidentally provides a somewhat weak symmetrical key and accidentally only hashes symmetrical input Meow hash still has 512 bits of internal state, and symmetry will be broken by the length block. It might not be worth thinking about.

cmuratori commented 3 years ago

Again as the resident non-cryptographer, I can only really add performance commentary:

On x64, AESDEC generally issues once per clock. So you can do one every cycle. It takes four cycles to complete, so you can't use the results immediately, though.
XORs vs. ADDs doesn't make any difference usually, unless it's some kind of special add. If you want more adds and less xors, they should be free to substitute (on x64 - ARM I have less experience with, so that may be different).
Regarding "explicitly parallel", I don't think threading is particularly interesting because any hash can always be used threaded to improve the speed, so I'm not sure that building this into the hash does anything in particular. We could issue a standard and/or example saying "here is how you compute the hash in a tree form" or something, just to make it easier to standardize.

That said, I do think "parallelizing" across 256 or 512-bit lanes might be the right way to go. AVX 256-bit instructions in particular are terrible, and they hate crossing the two 128-bit lanes. So something that was natively designed to process two independent 128-bit streams would vastly outperform something that wasn't on AVX.

AVX-512, on the other hand, is very hard to say. It's extremely janky right now and basically doesn't exist on the desktop (except in laptops). So I think ti's way too early to say what would be a smart design there, although I suspect the same thing holds. So designing a "four wide" input pipe might not be a bad idea, just looking forward. I don't know how this relates to Neon, however.

Hearing from @NoHatCoder that "the whole shifted thing is out in future designs" is music to my ears, because that was a big problem. Although technically the L1 cache can fill two cache lines per clock, in practice it actually can't because if it needs to service L2, then it loses one of those fills. This means that, if you are trying to run at full L2 bandwidth, you actually run into problems because the L1 can't service the load rearrangement. So a design that did not do duplicative loads would potentially have higher performance, or could afford to do more work and still be the same speed.

- Casey

NoHatCoder commented 3 years ago

Did a take on a disclaimer. I have worded it with a higher priority for authentication customers than hash table customers as the hash table case requires some pretty serious timing attacks that are very hard to pull off in the required quantity in the wild. Comments?

Due to recent discoveries by Peter Schmidt-Nielsen, we have decided to reclassify Meow hash 0.5/calico from level 3 to level 1. This means that we recommend not to use this hash for message authentication codes, or for hash tables in scenarios where collision induced denial-of-service attacks are a concern.

We have seen no evidence that the hash is unfit for non-adversarial/non-cryptographic purposes, and continue to believe that it is amongst the best in this regard.

For level 3/MAC capabilities consider migrating to SipHash. Do not migrate to any hash not advertising MAC capabilities as these are almost certainly much weaker than Meow 0.5. If the performance of SipHash is not satisfying, continuing to use Meow 0.5 for hash tables is better than migrating to another fast hash. While Meow 0.5 also continue to provide some useful strength for message authentication codes, we have to stress that we strongly recommend migration in this case.

petersn commented 3 years ago

Seems good to me.

cmuratori commented 3 years ago

For the time being, I went ahead and pasted that security warning into the README, just so there would be something up there while we work out what to do going forward.

- Casey

NoHatCoder commented 3 years ago

Maybe include a link to the analysis, seems weird not to have that.

cmuratori commented 3 years ago

Done.

- Casey

Daniele122898 commented 3 years ago

The "about" section still states that its a level 3. Maybe change that too for the time being to not be confusing.

cmuratori commented 3 years ago

Roger that - should be updated now?

- Casey

Daniele122898 commented 3 years ago

Looks good to me :)

cmuratori / meow_hash

Full 128-bit collision between two files #80