Todo list - Githubissues

folkvir commented 5 years ago

[x] HyperLogLog
[x] Top-K or Mine Top-K: Using Bloom Filters for Mining Top-k Frequent Itemsets in Data Streams, Younghee Kim et Al.
[x] Min Hash Wikipedia source, need to find the paper
[x] Scalable Bloom Filter, as the partitioned bloom filter is fixed now the Scalable bloom filter can be easily implemented.
[x] XorFilter #29

Feel free to modify this list or add suggestions in comments.

Chat-Wane commented 4 years ago

I don't need it. I didn't try it. But for the sake of completeness: https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/ Links to the paper and implementations (no js ones) are available there.

( Hello btw! :3 )

Callidon commented 4 years ago

Some update on this topic:

The HyperLogLog has been implemented in release v1.1.0.
The TopK has been implemented in release v1.2.0. It uses a Count Min Sketch for frequency estimation and a Minheap for implementing a sliding window of values.

Callidon commented 4 years ago

Some update: The MinHash has been released in v1.3.0.

ShisiJu commented 4 years ago

Thanks for your jobs.

Callidon commented 2 years ago

Update: XOR filters has been released in the 2.0.0 version of the npm package, following #52 merge.

folkvir commented 2 years ago

Update: the Scalable Bloom Filter with optimizations and bug fixes have been released in version 3.0.0

adevine commented 2 years ago

Hi @Callidon and @folkvir - thanks for a really useful library!

I had a suggestion for an additional data structure, as well as a question/suggestion for improvement on the CountingBloomFilter.

I have a need for a Time-To-Live, or "expiring", CountingBloomFilter. That is, when entries are added to the CountingBloomFilter, they will automatically be removed after some number of milliseconds. In addition, upon creation of the CountingBloomFilter, the creator can optionally specify a "dispose" callback function that will get called when the entry expires. The idea is generally to merge the concepts of a TTL Cache (e.g. https://github.com/isaacs/ttlcache) with a CountingBloomFilter. Obviously this could just be built as a separate data structure on top of CountingBloomFilter that manages the timeouts, but I think there is a lot of utility to having it in one single data structure. I am going to implement this for a project I need, but please let me know if you would like me to contribute this back to your repository.

My other question relates to the implementation of the _filter member of the CountingBloomFilter class. This Array stores a tuple pair at each index that is [bit 0 or 1, count of items at this index]. But that first bit is always set to 1 if count of items is > 0, and set to 0 if count of items === 0. So I might be missing something but why not just make _filter an array of numbers where each entry is the count of items at that index, and then the has method would just look at whether that value is > 0? Seems like having each value in the _filter member storing an Array object with 2 numbers is a lot of unnecessary memory.

Callidon / bloom-filters

Todo list #8