aappleby / smhasher

Automatically exported from code.google.com/p/smhasher
2.66k stars 469 forks source link

[Request] Why not provide an intermediate MurmurHash3_x64_64? #61

Open wiz0u opened 6 years ago

wiz0u commented 6 years ago

It would output a uint64_t, easier to manipulate than a 128-bit buffer and less collision-prone than the 32-bit variant

Thanks

lemire commented 6 years ago

What is wrong with just using the 64 bits you need?

wiz0u commented 6 years ago

If you mean "take the low-order 64-bits from the 128-bits output": What mathematical proof do I have that these 64-bits are as much collision-proof as a specifically 64-bits MurmurHash3-type algorithm could provide?

I feel it would be the same as for GUID: A substring of it has no guarantee to be unique Then taking a substring of MurmurHash3-128 output could have unexpected effects like raising the collision-risk much more than anticipated for a 64-bits output.

wiz0u commented 6 years ago

However, I note that answers at crypto.stackexchange.com seem more positive about taking parts of a hash.

sebres commented 6 years ago

If you mean "take the low-order 64-bits from the 128-bits output"...

Why not both (with any mathematical expression in-between)...

uint64_t h[2] =  (uint64_t*)mmh128_res;
/* either simply sum */
return h[0] + h[1];
/* or simply xor with shift */
return h[0] ^ (h[1] << 1);
/* or sum by masked [1..4] multiplication of fmix of lower bits */
return h[1] * ((fmix64(h[0]) & 3) + 1) + h[0];

As regards the collision-proof of this, please read https://github.com/aappleby/smhasher/issues/58#issuecomment-380217772. If shortly, the data you'll use to hash is more important by the PoC for collision occurrence estimation as the mixin-algorithm (by remaining the same or similar mathematic logic as regards the bit distribution quality).

I feel it would be the same as for GUID

The GUID is also not really unique, but the time is mostly mixed there (so either the probability to get duplicate in the same millisecond is very low, or corresponding of used algorithm is quasi impossible only because of mixin with some world pseudo-unique params, like MAC/pid/tid/date/time/ms, by sequential GUIDs only). In this case cutting of GUID is very bad idea.

Anyway I had already seen two equal not-sequential GUID on very large database of large clustered application (the second was generated 10 years after first).

Then taking a substring of MurmurHash3-128 output could have unexpected effects like raising the collision-risk much more than anticipated for a 64-bits output.

This is affected in any case if you decrease the hash size (independent from used algorithm).

lemire commented 6 years ago

What mathematical proof do I have that these 64-bits are as much collision-proof as a specifically 64-bits MurmurHash3-type algorithm could provide?

If you can find a 64-bit hash function family that is far superior to MurmurHash3-casted-to-64bits, then I submit to you that you have found a glaring flaw in MurmurHash3.

You might enjoy this related reference:

raeburn commented 6 years ago

wiz0u writes:

If you mean "take the low-order 64-bits from the 128-bits output": What mathematical proof do I have that these 64-bits are as much collision-proof as a specifically 64-bits MurmurHash3-type algorithm could provide?

Probably none. If you’re asking for “mathematical proof”, I wonder if your requirements might be strict enough that you want a stronger hash function. It’s pretty easy to intentionally generate messages that’ll collide under MMH3.

I will suggest this: Many failure modes you might imagine for a truncated hash (output values unevenly distributed, correlated output bits, output bits correlated with input bits) would also imply similar failures in the full version of the hash, and they’d have to have slipped past testing so far, for whatever that’s worth.

I’d be curious to see how a MurmurHash3_x64_64 would perform vs the 128-bit version.

"Sergey G. Brester" writes:

If you mean "take the low-order 64-bits from the 128-bits output"...

Why not both (with any mathematical expression in-between)...

uint64_t h[2] = (uint64_t)mmh128_res; / either simply sum */ return h[0] + h[1];

In general, but especially for a non-cryptographic hash like MMH3, I’d be wary of inventing some additional mixing operation like this without regard to how the hash itself operates.

I certainly understand the temptation to throw on an extra layer of mixing, like shuffling a deck of cards one or two more times before using it. If there’s any semblance of order still left, it’ll probably help, right? And even if not, what harm could it do? Been there, done that...

The MMH3 algorithm ends with:

h1 = fmix64(h1); h2 = fmix64(h2); h1 += h2; h2 += h1; ...store h1,h2...

So if your “any mathematical expression” is doing subtraction, you might wind up essentially negating the last mix step or two of MMH3. You could wind up using one of the fmix64 return values and discarding the rest of the internal state at that point!

How bad is that? I don’t know, but the author presumably had the final additions there for a reason. (Perhaps a reason that only really makes sense if you’re keeping more than 64 bits of the output?)

A lot of hash functions cycle between different types of operations. You see in in MMH3, too---multiply, rotate, multiply, XOR, rotate, multiply+add. If you are going to do some additional mixing after MMH3, since MMH3 ends with additive mixing, I’d suggest doing something different---logical operations (shift/rotate several positions and then XOR) or multiplication (ignoring the bottom bits of the result, which are more likely to be 0 than 1).

/ or simply xor with shift / return h[0] ^ (h[1] << 1);

My intuition says that shifting by only one bit to XOR, after addition (at the end of MMH3) using one of the same operands, would be something to look at very closely in case it might interact in some detrimental way (a carry during addition also affects the next bit position up), but I haven’t done the work to figure out if it’s an issue.

When I find myself throwing together something like this, I try to stop and ask myself: Is this actually better, or is it just harder for me (but maybe not an experienced, determined attacker) to see any problems because I’ve made it more complicated?

Brian Kernighan is quoted as saying, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” I think something similar holds for analyzing encryption and hash functions.

/ or sum by masked [1..4] multiplication of fmix of lower bits / return h[1] * ((fmix64(h[0]) & 3) + 1) + h[0];

Okay... same question applies, better or just more complicated?

Here we get similar linear combinations of the fmix64 return values as we have with h[0]+h[1]. And if you choose this because you think there’s some sort of problem with using just h[0]+h[1], well, you’ll still wind up with the value h[0]+h[1] about one time in four, so you’ve probably still got a problem, even if you think h[0]+4*h[1] is okay.

Ken

sebres commented 6 years ago

I wonder if your requirements might be strict enough that you want a stronger hash function.

My intuition says that shifting by only one bit to XOR, after addition (at the end of MMH3) using one of the same operands, would be something to look at very closely in case it might interact in some detrimental way

I think something similar holds for analyzing encryption and hash functions.

so you’ve probably still got a problem, even if you think h[0]+4*h[1] is okay.

I'll repeat again (see https://github.com/aappleby/smhasher/issues/58#issuecomment-380217772):

MMH-3 function belongs not to the class of strong cryptography. It's a hash-function which primary role to provide a good distribution within short time, so it trades off "correctness" for speed.

From this reasons your arguments are unfounded. And an extra one (or even 4) additionally sumarizations does nothing in case of MMH-hash, just because of deterministic character as well as the sum is already large part of algorithm self (repeated multiple times within each iteration).

Fact is the 64-bit hash value will produce basically a collision 2​32 times often as the usage of 128-bit hash (2​64 / 2​32, as BP says to us we'll catch first collision already by max 2​N⁄2 values). That's in "ideal" case, as regards the pseudo-random data, where the data of @wiz0u probably are not.

What I'll explain hereby - by the collision property estimation the "quality" of the data used for hashing plays more important role as the way how the 128-bit hash will be "truncated" to 64-bit hash value.

As example of such (experimental) test estimation I tried to simulate the same test as @derste2 once made. Hashing of 5M file paths (utf-8, but mostly ascii) produces in my case:

Then I repeated the same process this time for 5M unique pseudo-random strings (byte-arrays, 100 bytes long):

You see, the difference was the data only resp. the class of data, so ascii-strings vs. byte-arrays (where in one case one algorithm was better suited as in other case vice versa).