On the choice of integer / value hashing algorithm

plajjan commented 1 year ago

We should have a proper algorithm for hashing of ints and other things. We should probably aim for something DoS resistant too.

Conversation started in https://github.com/actonlang/acton/pull/992#issuecomment-1292166708:

The hash functions for int should be considered temporary: it uses only the last 64-bit digit. But this is part of the more general question to change all hash functions in builtin/hash.c. This was one of the first C files written in this project, and reiles too much on Python hash decisions. There are several much better proposals out there; what is annoying is that it is difficult to find good advice as to which of the modern alternatives to choose. This has partly to do with the question of whether protection against hash flooding attacks are important; when Acton conquers the clooud this may be the case...

@plajjan found this the other day: https://twitter.com/pcwalton/status/1583931446305038336

So .NET appears to switch from a simple & fast hashing algorithmm to something DoS resistant. Can we do the same?

sydow commented 1 year ago

There is one obvious candidate: siphash. It is widely used and developed by people with a very good reputation. For proper DoS protection, it should be combined with a truly random key, generated from a random source at program startup. This also requires that we rethink serialization of dictionaries and sets; we will need to store a list of key/value pairs and a list of elements, respectively, and rebuild at deserialization. We cannot just memcopy the content of the hash table, since the hash function will have a new key the next time we run the program.

sydow commented 1 year ago

I missed your link to the comment on siphash and Rust. Yes, my understanding is that siphash is a bit slow, in particular on long strings, which I doubt is our typical use case. But my understanding is not very well-founded...

plajjan commented 1 year ago

I think .NET is using xxHash and people are talking about it as a very fast hash algorithm: https://github.com/Cyan4973/xxHash

Is this idea of using one hash and switching to a more expensive one feasible? Like could we switch? How would we switch? I imagine we would want to decide this per table/dict, right? So fort most places we'd need a hash we can run with the fast one but if we notice lots of collisions in some dict we can switch to another?

I'm really not used to thinking about this problem so maybe I'm going about it all wrong.

plajjan commented 1 year ago

If we have to pick one, I agree that the safer siphash is a better choice.

actonlang / acton

On the choice of integer / value hashing algorithm #1009