Callidon / bloom-filters

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
https://callidon.github.io/bloom-filters/
MIT License
378 stars 43 forks source link

Compact serialized form for BloomFilter #42

Closed dleppik closed 2 years ago

dleppik commented 3 years ago

The serialized form of BloomFilter is often larger than the original data. I haven't tested in-memory usage, but since it uses an array of 64-bit 1's and 0's, the same is likely true. This implies that there is no advantage to a BloomFilter over a JavaScript Set when hashing short strings.

For example, an 8.2 Mb file (3.6 Mb gzip compressed) of 1 million leaked passwords serializes to 27 Mb (2.3 Mb compressed). This is with false positive tolerance of 0.1%.

I have written a fix for this which replaces the array with a BitSet backed by a Uint8Array and serializes the array with base64 encoding. This changes the 27 Mb filter to 2.3 Mb (1.3 Mb compressed).

I'm prepared to make a pull request for this fix, if you give me permission. Otherwise I can send you the changes a different way.

Callidon commented 3 years ago

Hello

Indeed, that's a bit of an issue. It's great if you already found a solution. Your implementation idea seems good, so feel free to send a pull request with the final version šŸ‘

On a side note, do you think this issue can happen with other data structures in the framework, aside from bloom filters? And if so, do you think your fix can be applied too?

dleppik commented 3 years ago

Looks like I need to be a collaborator to do a pull request.

This can be applied as a drop-in replacement anywhere that you're just storing booleans. So, for example, you can't use it for a counting Bloom filter. However, the general principal applies. If your counting Bloom filter only counts up to 255, you can replace the array with a Uint8Array.

Right now you're using 64-bit Floats for everything. JavaScript implementations are smart enough to implement these with native 64-bit arrays. (I confirmed with "node --inspect" that the BitSet implementation is 64x more memory efficient.) So it makes sense to use a TypedArray where it will make a significant difference. The good news is that in many cases you can replace your allocateArray() with a constructor for the appropriate TypedArray and immediately see an improvement. Since what you really want in most cases is a 32-bit (or less) integer, you can cut your memory usage in half. Also, TypedArrays are initially all 0's, so you can eliminate that code.

I am using a Uint8Array so I don't have to deal with endianness when serializing. It's just a few extra lines of code to use a DataView, but then I'd feel the need to test serialization on multiple architectures.

Callidon commented 3 years ago

Nice news šŸ˜ƒ You should be able to do a pull request If you first fork the repository and then request a PR from it. Otherwise, I will look into this when I have some spare time next week. I'm very interested in generalizing this idea to as many data structures as possible because the memory gain will be significant.

folkvir commented 3 years ago

Except for IBLTs where Buffers are used, it might be worth to implement all structures with Uint8Arrays yes! But it must be usable in both server and browser sides, the library is used on both sides so the optimized structures must be usable in both environments. Another question, 0s and 1s where used primarily for simplicity of implementations but it is also useful for using the structure back and forth from/to another language/implementation without having to transform all the time the core structure into its simplest form. In consequence if we can safely choose which core implementation to use, then I completely agree with all future modifications. It will be a significant improvements to the memory usage šŸ”„

dleppik commented 3 years ago

So long as you avoid 64-bit numbers, support goes all the way back to IE 10.

Iā€™m getting a 403: Forbidden error when I do a pull request. Iā€™m pretty sure I need to be listed as a collaborator.

On Nov 27, 2021, at 3:53 AM, A.G @.***> wrote:

Except for IBLTs where Buffers are used, it might be worth to implement all structures with Uint8Arrays yes! But it must be usable in both server and browser sides, the library is used on both sides so the optimized structures must be usable in both environments. Another question, 0s and 1s where used primarily for simplicity of implementations but it is also useful for using the structure back and forth from/to another language/implementation without having to transform all the time the core structure into its simplest form. In consequence if we can safely choose which core implementation to use, then I completely agree with all future modifications. It will be a significant improvements to the memory usage šŸ”„

ā€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

-- David Leppik http://www.leppik.net/david/ @.***

"As a computer shrinks, the gravitational force that its components exert on one another becomes stronger and eventually grows so intense that no material object can escape." -- Scientific American, Nov 2004, p. 56