blachlylab / dklib

D ports of klib algorithms
MIT License
5 stars 5 forks source link

Khashl addition #3

Closed charlesgregory closed 4 years ago

charlesgregory commented 4 years ago

Added khashl and made some changes to the benchmark and dub file. Khashl is hash table that performs deletions without tombstones allowing it to have a smaller memory footprint for some sacrifices in speed. It uses fibonacci hashing. Described here.

attractivechaos says this about khashl:

TL;DR: With linear probing, we can delete elements from an open addressing hash table without tombstones.

Deletion speed vs bytes per key

Deletion speed vs bytes per key

Interestingly, khashl is slower than my older khash.h. This may be caused by a combination of two effects. First, due to the presence of tombstones, khash.h has to double the number of buckets, resulting in fewer collisions. It implicitly trades memory for speed. Second, khashl may need to move multiple elements upon a deletion. Copying buckets can be slow. That said, the new deletion algorithm is only a little slower than khash.h and is faster than many other libraries for this particular task. It might also become faster than khash under a different payload (e.g. large batch of deletions).

Most notably:

In addition, khashl has simpler insertion. It is faster than khash even if no deletions are involved.

Considering the clear advantage in memory, I think the new deletion algorithm without tombstones is overall better than traditional algorithms. It should become the preferred way to implement hash tables.

Time, in msec, for n=500,000 operations benchmarked on WSL (Ryzen 1700); ldc2 -release

Operation HashMap khash khashl khashl (cached)
Insert 931 398 261 142
Retrieve (Serial) 14 1 18 --
Retrieve (Random) 50 25 36 --

There are no retrieval times for the cached version as that benchmark uses integer keys and my khashl template prevents that at compile time as there should be no real benefit for integers.

attractivechaos says this about hash-caching:

When we use long strings as keys, comparing two keys may take significant time. This comparison is often unnecessary. Note that the hash of a string is a good summary of the string. If two strings are different, their hashes are often different. We can cache the hash and only compare two keys when their hashes are equal.

jblachly commented 4 years ago

@charlesgregory How much of khashl is identical to khash? A bit it seems. Would it be sensible to factor out (e.g. the hash functions), or is it more important that we retain source-level homology for future updates?

charlesgregory commented 4 years ago

@charlesgregory How much of khashl is identical to khash? A bit it seems. Would it be sensible to factor out (e.g. the hash functions), or is it more important that we retain source-level homology for future updates?

It appears that the uint and ulong hashing functions may be different for khash vs khashl. I think a factored-out pool of shared hashing functions is not out of the question. We can use aliases to keep decent source-level homology. Up to you though. https://github.com/charlesgregory/dklib/blob/bca6747bbeb4e1ff55063118a14b834152bb9716/source/dklib/khashl.d#L413-L436 https://github.com/charlesgregory/dklib/blob/bca6747bbeb4e1ff55063118a14b834152bb9716/source/dklib/khash.d#L456-L472

jblachly commented 4 years ago

Interesting, didn't notice that. I was also thinking of the _put and _resize functions

I am happy to accept the code as is , if you'll move the benchmark to examples/ and revert the dubfile

jblachly commented 4 years ago

@charlesgregory ping

jblachly commented 4 years ago

@charlesgregory You'll also need to rebase on current master since we updated dubfile

charlesgregory commented 4 years ago

As a side note I have found that the free statements in the dtors of both khash and khashl can cause SEGSEVs so in normal use I usually comment this out. I notice this most when making a hashmap of hashmaps.

kfree(cast(void*) this.keys);
kfree(cast(void*) this.flags);
kfree(cast(void*) this.vals);

Edit: I forgot I actually opened an issue for this #2

charlesgregory commented 4 years ago

Did a rebase but I think I should've just done a merge. Either way I would do a squash merge to get rid of the commit history mess.

jblachly commented 4 years ago

Also do you need to make the changes to khasl that @n8sh made to khash? (regarding inlineing, which I accepted, and parameter in qualifier which is in the pending PR) ?

charlesgregory commented 4 years ago

I messed up the rebase/merge so we are closing this pull request and opening a new one.