Open l15k4 opened 7 years ago
Thanks for the benchmarks! It's awesome that the library seems to beat even Guava's Bloom filters, at least with your setup. I haven't had anyone share benchmarks before, so seriously thank you. Feel free to update the benchmarks section of the readme with actual numbers if you want.
Just realized I didn't mention in the readme, the filter is actually already serializable. I went to considerable lengths to make sure it was as well... I will update the readme.
Are you using multi-threading on the filter? If your application supports it, the concurrent multi-threading should be several times faster than single thread operation. The builder interface exposes an "expected concurrency" property similar to Java's ConcurrentHashMap that you can use to optimize the filter against the size of your thread pool. A mild warning that multi-threading has only been tested by me :) so there may be some bugs if you're doing things in an odd order. I would be very grateful if you're willing to test it, this is the only Cuckoo/Bloom library I'm aware of that supports concurrent multi-threading.
The memory usage is not quite as small as it could be (but it is faster) because I don't implement the bucket semi-sort mentioned in the main paper here. The figure on page 9 for the Cuckoo filter without semi-sort is how it should perform. Adding a semi-sort option would save maybe 10-20% space but I'm worried the CPU cost will be substantial. It's on my todo list but low priority unless someone needs it. There is a method to get the size, but I just realized I have a typo in the source file making getStorageSize()
non-public. I will fix this tonight and the Javadoc should resync by tomorrow. GetStorageSize() will give you the filter size in bits (not bytes).
There shouldn't be a large difference in collisions unless you're running the filter very close to full, care to share the code running your benchmark so I can look into it?
new version with visible getStorageSize()
is deploying to maven central, you should be able to update package soon. Keep me updated on benchmark results!
Thanks
Found your bench repo, will patch to use new functionality and make sure settings match between filters. It will take me a few days, need to setup my environment for scala.
If size is an issue for you I can improve lib in two ways to make filters smaller.
1) remove power-of-two limit on filter size. This is here to prevent modulus bias when selecting bucket for items, I may be able to find another way that adds negligible overhead without size restriction. 2) add option for bucket semi sort. This will save approx 15% at 1% false positive rate, if you need much lower than that it doesn't save much. Will do this second if first optimization doesn't make filter small enough, as it could be a lot of work and will definitely impact performance some amount.
With these two improvements filters will be significantly smaller than Bloom, and hopefully remain fast.
Thanks
Hey,
sorry for answering late, I'm being quite busy these days. Regarding concurrency, I don't have a single use case for it because everywhere I use bloom filters I depend on sequential processing due to get && put operations and I'm collecting info about it. Processing such event stream by multiple threads might lead to desinformation. Even though the resulting cuckoo filter would be the same, I would get skewed stats about duplicates if I was processing get && put by 8 threads.
If you give me some time I can improve the benchmarks later.
As to serialization, do you mean I should use Java serialization for that? I don't see a way how to serialize it.
Yes it supports serializable interface so it will serialize using built-in methods or Jackson etc. I have some unit tests you can check where I'm using Google's serializer tester. Space on disk may not be optimum, I know Google's Bloom uses its own serializer internally but not sure if it's for speed or space efficiency.
The serialization is quite a bottleneck, I was considering to use it for deduplication of many billions of incoming impressions so that persistence (throughput & size) is essential not only for performance reasons.
The idea was that I would use both bloom and cuckoo filters with different hash functions and thus reducing the collision rate. Because imho it would be highly unlikely that both filters could make false positive for the same element.
I think that using both bloom and cuckoo filters with high FPP (0.01) would lead to wayyy lower collision rate than using only Bloom or Cuckoo filter with low FPP (0.00001), also the overall size and performance would be better.
UPDATE:
Sorry I made a mistake, java serialization of cuckoo filter is not that bad after all, there is a penalty like 30% in size and performance but it's acceptable.
I benchmarked the "combined" filter (guavaBF Murmur3_128 & cuckoo sip24) and the results are not as good as I thought for 300M unique UUIDs and fpp=0.01
:
guavaBF Murmur3_128 finished in 520.975 seconds with 412860 collisions and size 359439702 bytes
cuckooF Murmur3_128 finished in 564.212 seconds with 2454872 collisions and size 536872541 bytes
combinedF finished in 706.076 seconds with 5861 collisions and size 896312241 bytes ...
It is approx. the same result in all aspects as using guavaBF Murmur3_128
with fpp=0.0001
Jakob, I'm excited to have my work used for something useful so I'll do what i can to make the library competitive with Bloom filters. I will run my own tests this week for
serialization speed Serialization space efficiency Ram use
If low false positive rate is important I would use my lib with Sip24 or Sha setting. Most (maybe all) bloom filters use Murmur3 which is good but does have some collisions.
At a false positive rate far below 1℅ Cuckoo filters are more space efficient. At above 1℅ false positive rate Bloom filters are more space efficient. Using both together won't actually help compared to a single Cuckoo filter with a lower false positive rate, as long as you use a secure hash like SHA or Sip24.
I attached a chart image from one of the academic papers on cuckoo filters. It shows the bits per item needed to achieve a certain false positive rate. You can see that for a 0.001% FP rate a bloom filter takes about 25 bits per item and a cuckoo filter takes 20. The pink line is the theoretical limit, it is not possible to use less space than this. You can see from the chart why cuckoo filters are much better for low false positive rates. Internally the lower your goal false positive rate, the more bits the filter will use from the hash of each item. This is the reason there is no need to combine filters, it just takes more time+space
You can also see on the chart that Cuckoo filters get more efficient space wise the lower the false positive rate is. On the image, the ratio of theoretical-minimum-bits to bits-used decreases. This isn't true of Bloom filters which diverge from the theoretical ideal as they get bigger. Basically, if you want really low false positive rates Cuckoo filters will use much less space (at least as soon as I fix space calculation :) )
EDIT: also cuckoo filter size is much bigger than it needs to be right now. I will patch filter size calculator to no longer give a bunch of extra space "just in case" and remove restriction that filter size is a power of two. This should make Cuckoo filter smaller than Bloom in ram
Just an update on this. I've got my own benchmarks running and I'm actively working on increasing hashing speed and removing the bucket size limit using https://github.com/lemire/fastrange .
Will keep you updated on results. Should be able to decrease memory footprint ~50% and hopefully roughly double single thread performance
Hi again, I have a local build that decreases size of filter substantially. Working on speed benchmarks now. Will publish new release as soon as performance optimizations are finished and tested. :smiley: . Size penalty for serialization is less than 1% increase from in-memory size. If you need better serialization speed looking into Kryo or FST serializers, library should be compatible with those.
Items Cnt | False Pos | Serialized Size |
---|---|---|
Cuckoo: 1000000 | Fp Rate: 0.01 | Serialized: 1047302 |
Bloom: 1000000 | Fp Rate: 0.01 | Serialized: 1198174 |
Cuckoo: 1000000 | Fp Rate: 0.001 | Serialized: 1439972 |
Bloom: 1000000 | Fp Rate: 0.001 | Serialized: 1797240 |
Cuckoo: 1000000 | Fp Rate: 0.0001 | Serialized: 1832643 |
Bloom: 1000000 | Fp Rate: 0.0001 | Serialized: 2396306 |
Cuckoo: 1000000 | Fp Rate: 0.00001 | Serialized: 2356204 |
Bloom: 1000000 | Fp Rate: 0.00001 | Serialized: 2995372 |
Cuckoo: 1000000 | Fp Rate: 0.01 | Serialized: 1048944 |
Bloom: 1000000 | Fp Rate: 0.01 | Serialized: 1198550 |
Cuckoo: 1000000 | Fp Rate: 0.001 | Serialized: 1441616 |
Bloom: 1000000 | Fp Rate: 0.001 | Serialized: 1797614 |
Cuckoo: 1000000 | Fp Rate: 0.0001 | Serialized: 1834288 |
Bloom: 1000000 | Fp Rate: 0.0001 | Serialized: 2396686 |
Cuckoo: 1000000 | Fp Rate: 0.00001 | Serialized: 2357848 |
Bloom: 1000000 | Fp Rate: 0.00001 | Serialized: 2995750 |
Cuckoo: 500000 | Fp Rate: 0.01 | Serialized: 525384 |
Bloom: 500000 | Fp Rate: 0.01 | Serialized: 599486 |
Cuckoo: 500000 | Fp Rate: 0.001 | Serialized: 721720 |
Bloom: 500000 | Fp Rate: 0.001 | Serialized: 899014 |
Cuckoo: 500000 | Fp Rate: 0.0001 | Serialized: 918056 |
Bloom: 500000 | Fp Rate: 0.0001 | Serialized: 1198550 |
Cuckoo: 500000 | Fp Rate: 0.00001 | Serialized: 1179840 |
Bloom: 500000 | Fp Rate: 0.00001 | Serialized: 1498086 |
Cuckoo: 100000 | Fp Rate: 0.01 | Serialized: 106536 |
Bloom: 100000 | Fp Rate: 0.01 | Serialized: 120230 |
Cuckoo: 100000 | Fp Rate: 0.001 | Serialized: 145808 |
Bloom: 100000 | Fp Rate: 0.001 | Serialized: 180134 |
Cuckoo: 100000 | Fp Rate: 0.0001 | Serialized: 185072 |
Bloom: 100000 | Fp Rate: 0.0001 | Serialized: 240046 |
Cuckoo: 100000 | Fp Rate: 0.00001 | Serialized: 237432 |
Bloom: 100000 | Fp Rate: 0.00001 | Serialized: 299950 |
@MGunlogson Great work! Thanks for the benchmark.
Hey,
I find it better in some aspects than bloom filters but I depend on storing it to a file and use it later.
Are you planing to add support for that in future? Do you see any technical problems in doing so?
Also I'm wondering how to get size of a cuckoo filter in memory or in persisted state. I'm doing some benchmarks and without knowing this I cannot really use them ...
Btw the results of the benchmark that tests false positive rate on 100M unique UUIDs :