AMDComputeLibraries / morton_filter

A compressed, sparse cuckoo filter (see https://www.vldb.org/pvldb/vol11/p1041-breslow.pdf)
MIT License
85 stars 17 forks source link

Self-resizing #1

Closed sebdeckers closed 3 years ago

sebdeckers commented 5 years ago

The readme mentions self-resizing and the code has a resize function. However I could not find reference to this in the paper. Is this mechanism (and potential trade-offs) described somewhere?

abreslow commented 5 years ago

As of today, there is no publicly available description of the algorithm and trade-offs as of yet. Presently, we have an extended version of the VLDB'18 paper that has been accepted at the VLDB Journal that describes how resizing works and some associated caveats. That extended work also describes how to backport the algorithm to standard cuckoo filters. I backported the algorithm to Fan et al.'s codebase and got resizing throughput that was a bit better than the best performing Morton filter configuration. Morton filters have additional overheads during resizing that arise from their compressed block format and associated embedded metadata arrays.

I will try to see if I can get AMD's approval for putting the review copy on arXiv.

abreslow commented 5 years ago

I wanted to give some context in terms of what is happening during resizing since it may be awhile before the full text is released, and the resizing operation is a key feature. The inspiration for self-resizing comes from quotient filters, another approximate set membership data structure. Unfortunately, quotienting doesn't work straight out of the box on a Morton filter. Specifically, the common formulation of quotienting assumes power of two table sizes and is not blocked. Both of which are necessary for maintaining all of the Morton filter features upon resizing.

resize

Above is an image depicting self-resizing as it is done in a Morton filter. The technique is similar to quotienting (although not exactly the same). Morton filters currently support resizing (currently has to be manually called) the filter's capacity multiplicatively by powers of two (e.g., 2x, 4x, or 8x larger). Thus there are two block stores, the originating Block Store (Old Block Store) in the figure, and the Block store to which the Old Block Store's fingerprints will be relocated (New Block Store). In the figure above, we resize by a factor of 4x, which means that there are four blocks in the New Block Store for every block in the Old Block Store.

To map fingerprints from the Old Block Store to the New Block Store, we define a mapping that for each block in the Old Block Store remaps its fingerprints to four adjacent child blocks in the New Block Store. To determine which block in the New Block Store receives each fingerprint, we use a deterministic subset of the bits of each fingerprint. In the figure, we assume that the Morton filter has already been resized by 2x. During that resizing, the most significant bit of each fingerprint was used to select between the two candidate blocks.

What the figure depicts is a subsequent resizing where we increase the filter's size by 4x. Since we have already used the leading bit of each fingerprint to map each fingerprint in the sample block there, all of the leading bits are the same (0). For the current resizing, we thus use the next two most significant bits of each fingerprint. That gives which child block receives the output.

You'll notice during the resizing process that the block-local logical bucket indexes are unperturbed during resizing. Rather, only the choice of output block changes based on the leading bits of the fingerprint. This feature is key because it leaves the block structure unmodified. Also notice that we redundantly store the leading bits after a resizing. Strictly speaking, since they are also part of the global bucket index, we don't have to. However, this choice simplifies the implementation and means that we don't have to reconfigure block formats on subsequent resizings. We think that this space/performance tradeoff is most often worth it in practice.

We also have to modify the hash functions H1, H2, and H' to take into account the resizing. However, the changes are not too bad. As for throughput, I noticed a small dip when enabling resizing, but nothing that is a deal-breaker.

Now in terms of the false positive rate, with each filter doubling you effectively reduce the fingerprint length by one bit. However, at the time of resizing, your load is also less, so resizing in a vacuum does not affect your false positive rate. Once more items are added to the filter, however, your false positive rate will increase. Thus, for each factor of two increase in size for a target error rate, you typically have to increase the initial fingerprint length by about one bit assuming everything else is constant.

Let me know if this explanation is sufficiently clear.

I also backported the algorithm to cuckoo filters. Below is a plot comparing the resizing throughput of a rank-and-select quotient filter, a cuckoo filter using a simplified variant of the algorithm above, and Morton filters with 3-, 7- and 15-slot buckets. Since the memory access pattern is a streaming sequential one versus a random one (e.g., lookups), it has much better data reuse. This improved locality explains why self-resizing so much faster than the other operations (e.g., lookups).

resize_throughput

You'll note that the cuckoo filters and Morton filters can resize very rapidly. I have subsequently improved the resizing algorithm and gotten about another 5x improvement for cuckoo filters. I hope to write a paper on this shortly or at the very least put something on arXiv.

sebdeckers commented 5 years ago

Thank you very much for the detailed, understandable explanation.

Also notice that we redundantly store the leading bits after a resizing. Strictly speaking, since they are also part of the global bucket index, we don't have to. However, this choice simplifies the implementation and means that we don't have to reconfigure block formats on subsequent resizings. We think that this space/performance tradeoff is most often worth it in practice.

I suppose a space-constrained encoding of the data store could do a pass to eliminate this redundancy? E.g. when sending the state across a network.

abreslow commented 5 years ago

Yes, what you describe should work.

abreslow commented 5 years ago

The VLDBJ version of the paper has now been posted at the following link: https://link.springer.com/article/10.1007/s00778-019-00561-0

I hope to soon post an open version on my website. Stay posted for details.

abreslow commented 3 years ago

For those reading this thread after December 2020, as of last month, the VLDBJ paper is available free of charge on ResearchGate with a valid account. Link here