daokoder / dao

Dao Programming Language
http://daoscript.org
Other
199 stars 19 forks source link

Change MurmurHash 3 to a faster hash without sacrificing quality #528

Open dumblob opened 7 years ago

dumblob commented 7 years ago

TL;DR

Jump to the end of this thread to see the latest proposal.

Original proposal:

As per tests on https://github.com/Cyan4973/xxHash it seems xxHash is a way faster (2x) option to the currently used MurmurHash 3 while offering the same quality. xxHash is BSD licensed (i.e. compatible with MIT).

daokoder commented 7 years ago

The MurmurHash3 was not only selected for speed but also for simplicity. Since MurmurHash3 is already fast enough and hashing comprises only a tiny fraction of the total execution time in most applications, doubling the hashing speed will not bring noticeable improvements in speed. For applications that really need fast hashing, providing appropriate modules will be a better solution.

dumblob commented 7 years ago

I agree, that it's not needed at the moment and that the linked library is quite horrible. I'm though still convinced the algorithm should be built in the DaoVM reference implementation. My Dao source code uses hash maps a lot (and for critical stuff as fast network data processing) and internally Dao uses also hashing, so I believe in bigger applications this will play a significant role.

xxHash is also quite simple (look at a "high-level" implementation in JS - 32bit 64bit - which is way nicer than the original C implementation full of ifdefs and other unnecessary clutter).

In my opinion it's worth it to rewrite this high-level implementation in C99 just for the DaoVM purpose - it'll be small. The linked xxHash C library should not be incorporated as you said because it's bloated and might become a Dao module at most.

rurban commented 7 years ago

xxHash is only fast for large buffers. For small buffers or usage in a hash table FNV1 is still the fastest. Murmur 3 is also not really recommended.

daokoder commented 7 years ago

For small buffers or usage in a hash table FNV1 is still the fastest.

Due to its simplicity, it should be quite fast. Just curious, are there many FNV primes for a given bit length? If not, there will be not be enough hash seeds to choose randomly, when it is necessary to defend agains hash key collision attacks (e.g. in web application).

Murmur 3 is also not really recommended.

Why is that? Any benchmarks to compare FNV1 and Murmur3?

dumblob commented 7 years ago

xxHash is only fast for large buffers. For small buffers or usage in a hash table FNV1 is still the fastest. Murmur 3 is also not really recommended.

I'm certainly missing something as both arguments seem void based on the tests in your, @rurban, repository: https://github.com/rurban/smhasher .

rurban commented 7 years ago

You have to use the right benchmarks :)

If you just test the hash function by itself, the longer wordwise ones dominate of course. But this is only useful for file or networking digests.

If you test them inside a typical hash table, usually inlined if small enough, the smaller ones dominate. You have to follow the link to "When used in a hash table the instruction cache will usually beat the CPU and throughput measured here. In my tests the smallest FNV1A beats the fastest crc32_hw1 with Perl 5 hash tables." https://github.com/rurban/perl-hash-stats#average-case-perl-core-testsuite

dumblob commented 6 years ago

Well, the primary motivation for the switch was speed, but I do trust quite a lot the smhasher "benchmark" and mainly the quality classification whereas I would never accept any hash function, which doesn't perform at least at the level GOOD (otherwise there are security risks with severe effects like DoS originating in bad distribution of keys or O(n^2) worst-case scenarios triggered by a specially crafted inputs/packets/..., which is not acceptable for default built-in algorithms of general purpose languages like Dao).

Should this come again to question, it should be quite easy to implement xxHash portably based on the xxHash specification.

rurban commented 6 years ago

dumblob notifications@github.com schrieb am Do., 17. Aug. 2017, 19:42:

xxHash is only fast for large buffers. For small buffers or usage in a hash table FNV1 is still the fastest. Murmur 3 is also not really recommended.

I'm certainly missing something as both arguments seem void based on the tests in your, @rurban https://github.com/rurban, repository: https://github.com/rurban/smhasher .

Then read it again, what is tested there.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/daokoder/dao/issues/528#issuecomment-323143890, or mute the thread https://github.com/notifications/unsubscribe-auth/AACjUdZo-GhBrxLQfC0OQMgVbL8jDbaSks5sZHtVgaJpZM4KxLxW .

dumblob commented 6 years ago

It seems you mean FNV1a_YT instead of FNV1a. Then if we'll ignore CRC-based hashes it's the fastest. But also one of the very few with the absolute worst quality 😉 .

I can see in your smhasher-list though, that even faster than xxHash (especially for smaller keys as you noted) is t1ha - and without sacrificing any quality (according to smshasher). I have to admit I'm really surprised, that something can be so quick without sacrificing quality while being totally portable (yeah, it'll be slower on HW without native 64bit arithmetics, but that's a negligible issue nowadays).

rurban commented 6 years ago

Partially. FNV1a and it's mult. variants are the fastest for hash tables, besides _mm_crc32_u64 and t1ha. Even if the quality is a bit worse, the icache (the hash function) and dcache (the hash data structure) are more important than everything else.

Regarding security against DOS attacks, forget about the hash function, as even siphash can be brute forced. Only the collision resolution scheme can protect against DOS attacks.

dumblob commented 5 years ago

There is a new hash function wyhash, which seems to raise the bar of hashing (including PRNG) by a giant step. It's less complicated than xxHash, but more than MurmurHash. So if there will be any need, then there is something we can grab immediately as it's significantly faster and with a state of the art distribution of keys.

See also smhasher tests which @rurban regularly updates.

daokoder commented 5 years ago

as it's significantly faster and with a state of the art distribution of keys.

Where did you see this?

Please note speed is not the most important factor in choosing a hash function. wyhash seems quite new, not many people have looked into it to spot potential problems. It would be very unwise to switch to it.

Also, just with a glimpse of the wyhash source code, it is obvious that it will produce different results on machines with different endianness. This is quite alarming for me.

dumblob commented 5 years ago

Where did you see this?

https://github.com/rurban/smhasher/blob/master/doc/wyhash https://github.com/rurban/smhasher/blob/master/doc/wyhash32low

Please note speed is not the most important factor in choosing a hash function.

Sure, that's why I noted "if there will be any need" :wink:.

This is quite alarming for me.

This is interesting. Are there any places in Dao which require that? Or do we expose it somewhere, so that it allows e.g. sending a computed hash over network and blindly using it on another machine? Bytecode? Or something else?

wangyi-fudan commented 5 years ago

师兄好,我是2000级复旦生科的。wyhash是小弟开发的。 ^_^

rurban commented 5 years ago

dumblob notifications@github.com schrieb am Do., 13. Juni 2019, 12:44:

Where did you see this?

https://github.com/rurban/smhasher/blob/master/doc/wyhash https://github.com/rurban/smhasher/blob/master/doc/wyhash32low

Please note speed is not the most important factor in choosing a hash function.

Sure, that's why I noted "if there will be any need" 😉.

This is quite alarming for me.

Murmurhash3 is also not endian compatible. Besides much slower, bigger and worse quality than wyhash. And both are 64bit only.

There are some good endian-portable and 32bit only hashes also.

This is interesting. Are there any places in Dao which require that? Or do

we expose it somewhere, so that it allows e.g. sending a computed hash over network and blindly using it on another machine? Bytecode? Or something else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/daokoder/dao/issues/528?email_source=notifications&email_token=AAAKGUNWQJ4VPAWELHWWVZLP2IQHDA5CNFSM4CWEXRLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXTJJ7Q#issuecomment-501650686, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAKGULA4HGEMHMAUIVEIGTP2IQHDANCNFSM4CWEXRLA .

daokoder commented 5 years ago

This is interesting. Are there any places in Dao which require that? Or do we expose it somewhere, so that it allows e.g. sending a computed hash over network and blindly using it on another machine? Bytecode? Or something else?

Dao does not really require that. In fact, I didn't pay attention to this until recently. In some networked applications (such as certain types of games), the clients and server are required to have deterministic and identical behavior, which cannot be guaranteed if endian-dependent hashes are used for map keys.

There are some good endian-portable and 32bit only hashes also.

I had a look at MurmurHash3, it turns out it is not endian-neutral either, but it can be easily fixed. I will also look into theses hashes when I am less busy.

@rurban Curious, what platforms and compilers do you use for smhasher tests? I noticed that wyhash uses intrinsics, could its speed is mainly a result of this? If this is the case, it should not be very difficult to modify some other hash functions to use intrinsics too.

daokoder commented 5 years ago

师兄好,我是2000级复旦生科的。wyhash是小弟开发的。 ^_^

你好,不错:thumbsup:

dumblob commented 5 years ago

In some networked applications (such as certain types of games), the clients and server are required to have deterministic and identical behavior, which cannot be guaranteed if endian-dependent hashes are used for map keys.

Well, this can be (and shall be IMHO) easily ensured in the serialization/deserialization routines from the Dao standard library before/after sending/receiving it over network. Another (even better) solution would be to use e.g. Cap'n Proto or sproto for such "shared" structures to avoid having to pay attention to such details.

So in the end I'm still not convinced endianess neutrality really matters in Dao :wink:.

daokoder commented 5 years ago

Well, this can be (and shall be IMHO) easily ensured in the serialization/deserialization routines from the Dao standard library before/after sending/receiving it over network.

It is not about sending data over network. I was talking about something like Deterministic Lockstep for Networked Games

dumblob commented 5 years ago

It is not about sending data over network. I was talking about something like Deterministic Lockstep for Networked Games

Thanks for the article. I'm though still not there as I'm not aware of any Dao/Dao_modules API which would expose the internals of a hash function and I can't imagine how e.g. the hash table API would cause a different result when replaying inputs as in the deterministic lockstep scenario (assuming just the Dao APIs were involved). I must be still overlooking something - I'm eager to get enlightened :wink:.

daokoder commented 5 years ago

I must be still overlooking something

You indeed overlooked something: for-in iteration over hash map😄.

dumblob commented 5 years ago

Yeah, it's tedious, but again - look at the scenario below. In this case it absolutely doesn't matter, that one machine is little endian and the other big endian.

unstable_hash00

So what is the hidden secret I'm missing?

daokoder commented 5 years ago

So what is the hidden secret I'm missing?

Ok, maybe you are unaware that Dao hash map is an unordered associative array. If a hash map has multiple keys, the keys will be sorted by their hash values (which may appear unordered if the ordering by hash values is different from ordering by their actual values). If the hash function is not endian-neutral, different endianness can generate different hash values for the keys, so they can be visited in different orders in for-in iteration.

And forget about data over network, the application could use hash maps for internal data that are not communicated over network. As an example, game engine such as Urho3D uses hash maps for scene node components and event registration etc., their handling will depend on the ordering of the keys.

dumblob commented 5 years ago

Oh yeah, thanks for explanation. I must say I didn't see anywhere in Dao doc, that hash maps are ordered and that they're even deterministically iterable and that the order is guaranteed cross-machine. And because it's a hash map I would never get the idea, that there is some defined ordering and - crosses himself - never expected that a hash map will be iterated in the same order even if it's the same invariable map immediately iterated two times in a row.

With a tree-based (e.g. weighted red-black) map that's a different story, but with hash map it's really surprising that Dao needs something like endianess independence. In the worst case I would call it an optimization for the case when one wants to sort the iterated hash map pairs (key + value) using an algorithm which strongly benefits from a partially sorted sequence (e.g. Tim sort), but even then I'm not sure whether the requirement of having an endianess independent hash function is justified.

daokoder commented 5 years ago

Oh yeah, thanks for explanation. I must say I didn't see anywhere in Dao doc, that hash maps are ordered and that they're even deterministically iterable and that the order is guaranteed cross-machine. And because it's a hash map I would never get the idea, that there is some defined ordering and - crosses himself - never expected that a hash map will be iterated in the same order even if it's the same invariable map immediately iterated two times in a row.

Hash map in many languages/implementations including Dao does guarantee a deterministic ordering of the keys, it's just that it is not a meaningful ordering. It is also called unordered map or unordered associative array so that users are reminded that they should not depend on the ordering of hash keys for their programs. Nevertheless, hash map in Dao as well as in many other languages/implementations is deterministically iterable, for tasks that does not depend on the ordering of hash keys.

but with hash map it's really surprising that Dao needs something like endianess independence.

but even then I'm not sure whether the requirement of having an endianess independent hash function is justified.

It's not a requirement, but a feature that is nice to have in Dao. This feature will remove some pitfalls related to hash map. Maybe most users will not encounter such pitfalls, but who knows how they will use hash maps and what kind of problems they will try to solve.

I believe language designers should never speculate how users will use their languages and expect them to use their languages properly or in certain ways, and programming languages should be designed and implemented to contain as less pitfalls as possible.

daokoder commented 5 years ago

Just checked, Lua and the game engine I am using uses endianness-neutral hash functions for hash map. I haven't look into other languages, I think many of them are also using endianness-neutral hash functions (at least any serious language should).

Interesting, for string keys, Lua (5.2.3) does not use the whole strings for hashing.

rurban commented 5 years ago

Limin Fu notifications@github.com schrieb am Mi., 19. Juni 2019, 17:15:

Oh yeah, thanks for explanation. I must say I didn't see anywhere in Dao doc, that hash maps are ordered and that they're even deterministically iterable and that the order is guaranteed cross-machine. And because it's a hash map I would never get the idea, that there is some defined ordering and - crosses himself - never expected that a hash map will be iterated in the same order even if it's the same invariable map immediately iterated two times in a row.

Hash map in many languages/implementations including Dao does guarantee a deterministic ordering of the keys, it's just that it is not a meaningful ordering. It is also called unordered map or unordered associative array so that users are reminded that they should not depend on the ordering of hash keys for their programs. Nevertheless, hash map in Dao as well as in many other languages/implementations is deterministically iterable, for tasks that does not depend on the ordering of hash keys.

deterministic ordering leads to your insecure hashmaps. secure hashmaps need to be fully indeterministic, randomly ordered and randomly seeded. Any language which provide ordered iterations are either slow or insecure.

but with hash map it's really surprising that Dao needs something like

endianess independence.

but even then I'm not sure whether the requirement of having an endianess independent hash function is justified.

It's not a requirement, but a feature that is nice to have in Dao. This feature will remove some pitfalls related to hash map. Maybe most users will not encounter such pitfalls, but who knows how they will use hash maps and what kind of problems they will try to solve.

platform independent deterministic hashing has the same problems as above. best is not to expose any random ordering, eg in JSON or YAML exported maps. clients need to assume fully random ordering.

I believe language designers should never speculate how users will use

their languages and expect them to use their languages properly or in certain ways, and programming languages should be designed and implemented to contain as less pitfalls as possible.

Just checked, Lua and the game engine I am using uses endianness-neutral hash functions for hash map. I haven't look into other languages, I think many of them are also using endianness-neutral hash functions (at least any serious language should).

no, for sure not. serious languages should not expose security-critical internals, should not rely on internal secrets. random old popular languages do, but they were not properly designed and have to carry now the burden of old mistakes.

daokoder commented 5 years ago

deterministic ordering leads to your insecure hashmaps. secure hashmaps need to be fully indeterministic, randomly ordered and randomly seeded. Any language which provide ordered iterations are either slow or insecure.

I think the default behaviour should be deterministic. If you want non-deterministic behaviour, you can introduce it explicitly. In Dao, you can actually construct randomly seed hash maps using the explicit constructor:

map<@K=any,@V=any>( count: int, hashing: enum<none,auto,random>|int = $none )
            [index: int => tuple<@K,@V>] => map<@K,@V>

or reset the hashing seed randomly using:

reset( self: map<@K,@V>, hashing: enum<none,auto,random> )

Also, Dao modules web.cgi and web.http use randomly seeded hash maps for http get, post and cookie data.

dumblob commented 5 years ago

It's not a requirement, but a feature that is nice to have in Dao.

Understood. I also feel it's a nice to have feature.

This feature will remove some pitfalls related to hash map. Maybe most users will not encounter such pitfalls, but who knows how they will use hash maps and what kind of problems they will try to solve.

I tend to agree with @rurban that the default setting shall be secure rather then deterministically iterable. One can then use the explicit constructor or reset() to make the iteration deterministic or just use sort().

I'm also certain, that if we really want to support deterministically iterable maps, we can implement both hashing alogrithms. By default the faster one with random ordering with the constraint, that calling reset($auto) will make a complete rehashing due to switch to a different algorithm. I'm actually certain, that such "rehashing" won't be used much in practice and I'm also certain, that even the deterministic iteration will be used very sporadically.