Number of objects used README experiments

RagnarGrootKoerkamp commented 11 months ago

The readme provides Throughput numbers, but I'm wondering what is the dataset size used. Typically I would assume that throughput is higher for small sets and lower for larger datasets due to caching and such, so it's useful to provide this number.

ByteHamster commented 11 months ago

Thanks, I added the input set size to the readme and also brought the table up-to-date with the improved numbers from the final version of the paper. We use rather small datasets with 5 million objects. The reason is that we are here mainly interested in seeing what's possible in terms of space usage. Close to the lower bound, the construction performance drops quite a lot. The query throughput (probably more dependent on n than the construction because it needs random memory accesses) is quite a bit lower than with other, way less space-efficient (>25% more space) approaches. If you are interested, there are more details and measurements in our paper.

If you are interested in extremely space-efficient perfect hashing, we also have a more recent result called bipartite ShockHash which brings the practically achievable space down to just 1.489 bits/key (lower bound is 1.442). It doesn't have a GPU implementation, though.

I'm curious: Are you working on perfect hashing directly or are you using it as a building block?

RagnarGrootKoerkamp commented 11 months ago

Hey :)

Yeah it sounds totally expected that performance gets ~exponentially worse the closer you get to the lower bound. I'm quite impressed by how low RecSplit is able to push this! (I didn't spend enough time on the original paper yet to fully understand the details of how though.) For ShockHash it seems like you're approaching this from the direction of the theoretical lower bound, which intuitively makes sense to get low overhead results.

Some very biased thoughts (see below ;)

One the one hand it's super cool you're pushing the theoretical lower bound, on the other hand I doubt it matters for applications whether you use 1.45 or 1.5 bits/key (or even ~1.7 sgtm).
So all these methods have relatively slow construction time and you only test them up to millions of keys. Can they scale to (hundreds of) billions of keys? I suppose you would want to add some partitioningand parallellization layer on top (alike PTHash-HEM).
I suppose query speed is inherently slow for RecSplit based methods. Surely it can be squeezed a bit but it sounds difficult if you inherently need multiple dependent memory accesses. (I am assuming that is the case.)

So I'm currently working on PTRHash, an improved version of PTHash. This:

Currently supports up to 2^32 keys but will be extended to 2^40 keys very soon.
Builds 1e9 keys in 120s (1 thread), so 120ns/key. But note that it parallellizes extremely well. (For 5M keys single thread it's 60ns/key construction, and 6ns/key query.)
Queries are only a single lookup 98% of the time, and if not a single extra lookup is needed. In practice this means that query speed is limited by the memory bandwidth, and with 3 threads you can get 3ns/key throughput.

But then of course there is the big drawback compared to RecSplit:

It uses 2.3-2.7 bits/key depending on the exact configuration used.

So basically this is focussed on applications where query throughput is important, especially focussing on bioinformatics. It's also used as part of some larger indexes where the total storage is more like 8 bits/key, so the overhead here is relatively less on the total.

ByteHamster commented 11 months ago

I doubt it matters for applications whether you use 1.45 or 1.5 bits/key

Maybe, at least if you are interested in query performance. However, if we find a way to make the queries fast (by making the base case larger like in ShockHash, we already move a bit into that direction), this would still be interesting :) But I agree that much of what we are doing is more in the area of fundamental research :) Our older SicHash is a bit more focused on query speed. Seems to have a similar space usage and construction performance as FMPHGO, but slightly faster queries.

Can they scale to (hundreds of) billions of keys? I suppose you would want to add some partitioning and parallellization layer on top (alike PTHash-HEM).

SIMDRecSplit does have a parallel implementation, and the construction for larger space consumption is quite a bit faster than PTHash (see plots in readme file). An advantage is that it does not simply use the standard trick of partitioning (which causes some small query overhead) but uses the internal structure to parallelize without affecting the queries. However, based on RecSplit, the queries are still quite slow, even in the less space efficient configurations.

Surely it can be squeezed a bit but it sounds difficult if you inherently need multiple dependent memory accesses.

I think computational overhead also plays quite a big role in RecSplit's query speed. It needs to traverse a tree of variable-length encoded seed values for each query.

So I'm currently working on PTRHash

Interesting. I currently have a student who also worked on PTHash and he had very similar ideas. Fixed-width pilots, displacing and partitioning are also described in his masters thesis. The main ideas in the thesis seem to be orthogonal to your ideas, though.

RagnarGrootKoerkamp commented 11 months ago

SIMDRecSplit does have a parallel implementation, and the construction for larger space consumption is quite a bit faster than PTHash (see plots in readme file). An advantage is that it does not simply use the standard trick of partitioning (which causes some small query overhead) but uses the internal structure to parallelize without affecting the queries.

Ah yes, this is also what I'm currently doing in PTRHash.

Interesting. I currently have a student who also worked on PTHash and he had very similar ideas. Fixed-width pilots, displacing and partitioning are also described in his masters thesis. The main ideas in the thesis seem to be orthogonal to your ideas, though.

Would you have a link to this (either the thesis or code)? Would be interesting to read and/or cite.

ByteHamster commented 5 months ago

Hi @RagnarGrootKoerkamp, it has been long since you asked for it but we finally have something to show.

Here is the master thesis that mentions fixed-width pilots (section 4.5): https://doi.org/10.5445/IR/1000164413

We ended up not using the fixed-width pilots in the final implementation but we do use partitioning and displacing. The main new idea is to not use the 60%/30% distribution of expected bucket sizes but a more granular distribution where each bucket has its own expected size. Additionally, for encoding the pilots, we use an "interleaved" approach that encodes the i-th pilot of all partitions together. We just uploaded the TR yesterday: https://arxiv.org/pdf/2404.18497

Let me know if you have any questions or want to discuss more on perfect hashing :)

Closing this issue now because the number of objects is now mentioned in the readme.

ByteHamster / GpuRecSplit

Number of objects used README experiments #2