MGunlogson / CuckooFilter4J

High performance Java implementation of a Cuckoo filter - Apache Licensed
Apache License 2.0
174 stars 39 forks source link

Question: How to get size and serialize cuckoo filter #1

Open l15k4 opened 7 years ago

l15k4 commented 7 years ago

Hey,

I find it better in some aspects than bloom filters but I depend on storing it to a file and use it later.

Are you planing to add support for that in future? Do you see any technical problems in doing so?

Also I'm wondering how to get size of a cuckoo filter in memory or in persisted state. I'm doing some benchmarks and without knowing this I cannot really use them ...

Btw the results of the benchmark that tests false positive rate on 100M unique UUIDs :

sparkBF                 finished in 205.543 seconds with 0 collisions and size 299533092 bytes ...
guavaBF                 finished in 229.518 seconds with 79 collisions and size 299533086 bytes ...
sipHash24 cuckooF       finished in 183.235 seconds with 1093 collisions  ...
Murmur3_128 cuckooF     finished in 181.725 seconds with 1131 collisions ...
sha256 cuckooF          finished in 229.018 seconds with 1084 collisions ...
MGunlogson commented 7 years ago

Thanks for the benchmarks! It's awesome that the library seems to beat even Guava's Bloom filters, at least with your setup. I haven't had anyone share benchmarks before, so seriously thank you. Feel free to update the benchmarks section of the readme with actual numbers if you want.

Just realized I didn't mention in the readme, the filter is actually already serializable. I went to considerable lengths to make sure it was as well... I will update the readme.

Are you using multi-threading on the filter? If your application supports it, the concurrent multi-threading should be several times faster than single thread operation. The builder interface exposes an "expected concurrency" property similar to Java's ConcurrentHashMap that you can use to optimize the filter against the size of your thread pool. A mild warning that multi-threading has only been tested by me :) so there may be some bugs if you're doing things in an odd order. I would be very grateful if you're willing to test it, this is the only Cuckoo/Bloom library I'm aware of that supports concurrent multi-threading.

The memory usage is not quite as small as it could be (but it is faster) because I don't implement the bucket semi-sort mentioned in the main paper here. The figure on page 9 for the Cuckoo filter without semi-sort is how it should perform. Adding a semi-sort option would save maybe 10-20% space but I'm worried the CPU cost will be substantial. It's on my todo list but low priority unless someone needs it. There is a method to get the size, but I just realized I have a typo in the source file making getStorageSize() non-public. I will fix this tonight and the Javadoc should resync by tomorrow. GetStorageSize() will give you the filter size in bits (not bytes).

There shouldn't be a large difference in collisions unless you're running the filter very close to full, care to share the code running your benchmark so I can look into it?

MGunlogson commented 7 years ago

new version with visible getStorageSize() is deploying to maven central, you should be able to update package soon. Keep me updated on benchmark results!

Thanks

MGunlogson commented 7 years ago

Found your bench repo, will patch to use new functionality and make sure settings match between filters. It will take me a few days, need to setup my environment for scala.

If size is an issue for you I can improve lib in two ways to make filters smaller.

1) remove power-of-two limit on filter size. This is here to prevent modulus bias when selecting bucket for items, I may be able to find another way that adds negligible overhead without size restriction. 2) add option for bucket semi sort. This will save approx 15% at 1% false positive rate, if you need much lower than that it doesn't save much. Will do this second if first optimization doesn't make filter small enough, as it could be a lot of work and will definitely impact performance some amount.

With these two improvements filters will be significantly smaller than Bloom, and hopefully remain fast.

Thanks

l15k4 commented 7 years ago

Hey,

sorry for answering late, I'm being quite busy these days. Regarding concurrency, I don't have a single use case for it because everywhere I use bloom filters I depend on sequential processing due to get && put operations and I'm collecting info about it. Processing such event stream by multiple threads might lead to desinformation. Even though the resulting cuckoo filter would be the same, I would get skewed stats about duplicates if I was processing get && put by 8 threads.

If you give me some time I can improve the benchmarks later.

As to serialization, do you mean I should use Java serialization for that? I don't see a way how to serialize it.

MGunlogson commented 7 years ago

Yes it supports serializable interface so it will serialize using built-in methods or Jackson etc. I have some unit tests you can check where I'm using Google's serializer tester. Space on disk may not be optimum, I know Google's Bloom uses its own serializer internally but not sure if it's for speed or space efficiency.

l15k4 commented 7 years ago

The serialization is quite a bottleneck, I was considering to use it for deduplication of many billions of incoming impressions so that persistence (throughput & size) is essential not only for performance reasons.

The idea was that I would use both bloom and cuckoo filters with different hash functions and thus reducing the collision rate. Because imho it would be highly unlikely that both filters could make false positive for the same element.

I think that using both bloom and cuckoo filters with high FPP (0.01) would lead to wayyy lower collision rate than using only Bloom or Cuckoo filter with low FPP (0.00001), also the overall size and performance would be better.

UPDATE:

Sorry I made a mistake, java serialization of cuckoo filter is not that bad after all, there is a penalty like 30% in size and performance but it's acceptable.

I benchmarked the "combined" filter (guavaBF Murmur3_128 & cuckoo sip24) and the results are not as good as I thought for 300M unique UUIDs and fpp=0.01:

guavaBF Murmur3_128 finished in 520.975 seconds with 412860 collisions and size 359439702 bytes
cuckooF Murmur3_128 finished in 564.212 seconds with 2454872 collisions and size 536872541 bytes
combinedF finished in 706.076 seconds with 5861 collisions and size 896312241 bytes ...

It is approx. the same result in all aspects as using guavaBF Murmur3_128 with fpp=0.0001

MGunlogson commented 7 years ago

Jakob, I'm excited to have my work used for something useful so I'll do what i can to make the library competitive with Bloom filters. I will run my own tests this week for

serialization speed Serialization space efficiency Ram use

If low false positive rate is important I would use my lib with Sip24 or Sha setting. Most (maybe all) bloom filters use Murmur3 which is good but does have some collisions.

At a false positive rate far below 1℅ Cuckoo filters are more space efficient. At above 1℅ false positive rate Bloom filters are more space efficient. Using both together won't actually help compared to a single Cuckoo filter with a lower false positive rate, as long as you use a secure hash like SHA or Sip24.

I attached a chart image from one of the academic papers on cuckoo filters. It shows the bits per item needed to achieve a certain false positive rate. You can see that for a 0.001% FP rate a bloom filter takes about 25 bits per item and a cuckoo filter takes 20. The pink line is the theoretical limit, it is not possible to use less space than this. You can see from the chart why cuckoo filters are much better for low false positive rates. Internally the lower your goal false positive rate, the more bits the filter will use from the hash of each item. This is the reason there is no need to combine filters, it just takes more time+space 2017-01-09 13_19_36-cuckoo-conext2014 pdf

You can also see on the chart that Cuckoo filters get more efficient space wise the lower the false positive rate is. On the image, the ratio of theoretical-minimum-bits to bits-used decreases. This isn't true of Bloom filters which diverge from the theoretical ideal as they get bigger. Basically, if you want really low false positive rates Cuckoo filters will use much less space (at least as soon as I fix space calculation :) )

EDIT: also cuckoo filter size is much bigger than it needs to be right now. I will patch filter size calculator to no longer give a bunch of extra space "just in case" and remove restriction that filter size is a power of two. This should make Cuckoo filter smaller than Bloom in ram

MGunlogson commented 7 years ago

Just an update on this. I've got my own benchmarks running and I'm actively working on increasing hashing speed and removing the bucket size limit using https://github.com/lemire/fastrange .

Will keep you updated on results. Should be able to decrease memory footprint ~50% and hopefully roughly double single thread performance

MGunlogson commented 7 years ago

Hi again, I have a local build that decreases size of filter substantially. Working on speed benchmarks now. Will publish new release as soon as performance optimizations are finished and tested. :smiley: . Size penalty for serialization is less than 1% increase from in-memory size. If you need better serialization speed looking into Kryo or FST serializers, library should be compatible with those.

Items Cnt False Pos Serialized Size
Cuckoo: 1000000 Fp Rate: 0.01 Serialized: 1047302
Bloom: 1000000 Fp Rate: 0.01 Serialized: 1198174
Cuckoo: 1000000 Fp Rate: 0.001 Serialized: 1439972
Bloom: 1000000 Fp Rate: 0.001 Serialized: 1797240
Cuckoo: 1000000 Fp Rate: 0.0001 Serialized: 1832643
Bloom: 1000000 Fp Rate: 0.0001 Serialized: 2396306
Cuckoo: 1000000 Fp Rate: 0.00001 Serialized: 2356204
Bloom: 1000000 Fp Rate: 0.00001 Serialized: 2995372
Cuckoo: 1000000 Fp Rate: 0.01 Serialized: 1048944
Bloom: 1000000 Fp Rate: 0.01 Serialized: 1198550
Cuckoo: 1000000 Fp Rate: 0.001 Serialized: 1441616
Bloom: 1000000 Fp Rate: 0.001 Serialized: 1797614
Cuckoo: 1000000 Fp Rate: 0.0001 Serialized: 1834288
Bloom: 1000000 Fp Rate: 0.0001 Serialized: 2396686
Cuckoo: 1000000 Fp Rate: 0.00001 Serialized: 2357848
Bloom: 1000000 Fp Rate: 0.00001 Serialized: 2995750
Cuckoo: 500000 Fp Rate: 0.01 Serialized: 525384
Bloom: 500000 Fp Rate: 0.01 Serialized: 599486
Cuckoo: 500000 Fp Rate: 0.001 Serialized: 721720
Bloom: 500000 Fp Rate: 0.001 Serialized: 899014
Cuckoo: 500000 Fp Rate: 0.0001 Serialized: 918056
Bloom: 500000 Fp Rate: 0.0001 Serialized: 1198550
Cuckoo: 500000 Fp Rate: 0.00001 Serialized: 1179840
Bloom: 500000 Fp Rate: 0.00001 Serialized: 1498086
Cuckoo: 100000 Fp Rate: 0.01 Serialized: 106536
Bloom: 100000 Fp Rate: 0.01 Serialized: 120230
Cuckoo: 100000 Fp Rate: 0.001 Serialized: 145808
Bloom: 100000 Fp Rate: 0.001 Serialized: 180134
Cuckoo: 100000 Fp Rate: 0.0001 Serialized: 185072
Bloom: 100000 Fp Rate: 0.0001 Serialized: 240046
Cuckoo: 100000 Fp Rate: 0.00001 Serialized: 237432
Bloom: 100000 Fp Rate: 0.00001 Serialized: 299950
anhldbk commented 6 years ago

@MGunlogson Great work! Thanks for the benchmark.