KenanHanke / rbloom

A fast, simple and lightweight Bloom filter library for Python, implemented in Rust.
MIT License
228 stars 15 forks source link

Adding library for comparisons #14

Open jiwidi opened 1 day ago

jiwidi commented 1 day ago

Hi!

First thanks for the work done with rbloom. Are you aware of the library https://pypi.org/project/shaped-bloom-filter/ which is a wrapper in top of a go bloom filter. Curious to know performance between this and your rust implementation.

Thanks!

Dr-Emann commented 1 day ago

shaped-bloom-filter appears to only support 32 bit integers for its BloomFilter type (which seems pretty limiting).

However, its BloomFilterExtended supports bytes (with separate function names).

But when comparing performance, it appears to be the slowest implementation of any we compare with by far:

For example, using the following code:

import time

import rbloom
import shaped_bloom_filter

NUM_ITEMS = 100_000

def time_run(bloom_type) -> float:
    start = time.perf_counter()
    bf = bloom_type(NUM_ITEMS, 0.01)
    for i in range(NUM_ITEMS):
        bf.add(i)
    for i in range(NUM_ITEMS):
        if i not in bf:
            raise ValueError("Should be no false negatives")
    end = time.perf_counter()
    return end - start

print(f"rbloom              {time_run(rbloom.Bloom):.4f} sec to run {NUM_ITEMS} items")
print(f"shaped-bloom-filter {time_run(shaped_bloom_filter.BloomFilter):.4f} sec to run {NUM_ITEMS} items")

Outputs the following on my machine (m1 pro mac):

rbloom              0.0106 sec to run 100000 items
shaped-bloom-filter 31.3293 sec to run 100000 items

(approximately 3000 times slower).

I'm using Python 3.11.10, and installed with pip install shaped-bloom-filter setuptools (shaped-bloom-filter seems to require setuptools to be installed, but doesn't specify it as a dependency).

Unless I've done something wildly wrong, I don't think shaped-bloom-filter is even close to competitive, even when using ints directly.

jiwidi commented 1 day ago

shaped-bloom-filter appears to only support 32 bit integers for its BloomFilter type (which seems pretty limiting).

However, its BloomFilterExtended supports bytes (with separate function names).

But when comparing performance, it appears to be the slowest implementation of any we compare with by far:

For example, using the following code:

import time

import rbloom
import shaped_bloom_filter

NUM_ITEMS = 100_000

def time_run(bloom_type) -> float:
    start = time.perf_counter()
    bf = bloom_type(NUM_ITEMS, 0.01)
    for i in range(NUM_ITEMS):
        bf.add(i)
    for i in range(NUM_ITEMS):
        if i not in bf:
            raise ValueError("Should be no false negatives")
    end = time.perf_counter()
    return end - start

print(f"rbloom              {time_run(rbloom.Bloom):.4f} sec to run {NUM_ITEMS} items")
print(f"shaped-bloom-filter {time_run(shaped_bloom_filter.BloomFilter):.4f} sec to run {NUM_ITEMS} items")

Outputs the following on my machine (m1 pro mac):

rbloom              0.0106 sec to run 100000 items
shaped-bloom-filter 31.3293 sec to run 100000 items

(approximately 3000 times slower).

I'm using Python 3.11.10, and installed with pip install shaped-bloom-filter setuptools (shaped-bloom-filter seems to require setuptools to be installed, but doesn't specify it as a dependency).

Unless I've done something wildly wrong, I don't think shaped-bloom-filter is even close to competitive, even when using ints directly.

thanks this is what I was looking, any chance to test with int32s too?

but very promising for rbloom!

Dr-Emann commented 1 day ago

The test above does use int32s, it doesn't use BloomFilterExtended.