Baqend / Orestes-Bloomfilter

Library of different Bloom filters in Java with optional Redis-backing, counting and many hashing options.
Other
843 stars 245 forks source link

Bloom filter library

Changelog | Setup | Docs | Maven Repo

Please review the contribution guide for information on how to contribute to this project before supplying a merge request.

Getting Started

Rules

Version 1 is out with a complete rewrite of almost all functionalities and many new ones.

This is a set of Bloom filters we implemented as we found all existing open-source implementations to be lacking in various aspects. This library takes some inspiration from the simple Bloom filter implementation of Magnus Skjegstad and the Ruby Bloom filters by Ilya Grigorik.

The Bloom filter is a probabilistic set data structure which is very small. This is achieved by allowing false positives with some probability p. It has an add and contains operation which both are very fast (time complexity O(1)). The Counting Bloom filter is an extension of the Bloom filter with a remove operation at the cost of incurring an additional space overhead for counting. There are many good introductions to Bloom filters: the Wikipedia article is excellent, and even better is a survey by Broder and Mitzenmacher. Typical use cases of Bloom filters are content summaries and sets that would usually grow too large in fields such as networking, distributed systems, databases and analytics.

There are 4 types of Bloom filters in the Orestes Bloom filter library:

This library if written in and for Java 8. For a Java 7 backport see: https://github.com/Crystark/Orestes-Bloomfilter

Err, Bloom what?

Bloom filters are awesome data structures: fast and space efficient.

BloomFilter<String> bf = new FilterBuilder(10_000_000, 0.01).buildBloomFilter(); //Expect 10M URLs
urls.add("http://github.com"); //Add millions of URLs
urls.contains("http://twitter.com"); //Know in an instant which ones you have or have not seen before

So what's the catch? Bloom filters allow false positives (i.e. URL contained though never added) with some probability (0.01 in the example). If you can mitigate rare false positives (false negatives never happen) then Bloom filters are probably for you.

New in 2.0

New in 1.0

Features

There are a many things we addressed as we sorely missed them in other implementations:

Getting started

New: The Bloom filter repository is now hosted on JCenter.

Java 8 is required. The recommended way to include the Bloom filter is via the Maven repo (works for Gradle, Ivy, etc., too):

<dependencies>
   <dependency>
       <groupId>com.baqend</groupId>
       <artifactId>bloom-filter</artifactId>
       <version>1.0.7</version>
   </dependency>
</dependencies>
<repositories>
   <repository>
       <snapshots>
        <enabled>false</enabled>
       </snapshots>
       <id>central</id>
       <name>bintray</name>
       <url>http://jcenter.bintray.com</url>
   </repository>
</repositories>

or with Gradle:

repositories {
    jcenter()
}
dependencies {
    compile(
            'com.baqend:bloom-filter:1.0.7'
    )
}

For the normal Bloom filters it's even sufficient to only copy the source *.java files to your project (not recommended).

Usage

Regular Bloom Filter

The regular Bloom filter is very easy to use. It is the base class of all other Bloom filters. Figure out how many elements you expect to have in the Bloom filter ( n ) and then which false positive rate is tolerable ( p ).

//Create a Bloom filter that has a false positive rate of 0.1 when containing 1000 elements
BloomFilter<String> bf = new FilterBuilder(1000, 0.1).buildBloomFilter();

The Bloom filter class is generic and will work with any type that implements the toString() method in a sensible way, since that String is what the Bloom filter feeds into its hash functions. The hashCode() method is not used, since in Java it returns integers that normally do not satisfy a uniform distribution of outputs that is essential for the optimal performance of the Bloom filter. Now lets add something:

//Add a few elements
bf.add("Just");
bf.add("a");
bf.add("test.");

This can be done from different threads - all Bloom filters are now thread-safe. An element which was inserted in a Bloom filter will always be returned as being contained (no false negatives):

//Test if they are contained
print(bf.contains("Just")); //true
print(bf.contains("a")); //true
print(bf.contains("test.")); //true

Usually non-inserted elements will not be contained:

//Test with a non-existing element
print(bf.contains("WingDangDoodel")); //false

If we add enough elements, false positives will start occurring:

//Add 300 elements
for (int i = 0; i < 300; i++) {
    String element = "Element " + i;
    bf.add(element);
}   
//test for false positives
for (int i = 300; i < 1000; i++) {
    String element = "Element " + i;
    if(bf.contains(element)) {
        print(element); //two elements: 440, 669
    }
}

Let's compare this with the expected amount of false positives:

//Compare with the expected amount
print(bf.getFalsePositiveProbability(303) * 700); //1.74

So our two false positives are in line with the expected amount of 1.74.

and lets "estimate" how many elements are in the filter using statistically sound computations of the amount of bits that are one:

//Estimate cardinality/population
print(bf.getEstimatedPopulation()); //303.6759147801151

This estimation is very good, even though the estimation was performed on a "quite full" Bloom filter (remember, we allowed the false positive probability to be 10% for 1000 elements).

The Bloom filter can be cleared and cloned:

//Clone the Bloom filter
bf.clone();
//Reset it, i.e. delete all elements
bf.clear();

Also elements can be added and queried in bulk:

List<String> bulk = Arrays.asList(new String[] { "one", "two", "three" });
bf.addAll(bulk);
print(bf.containsAll(bulk)); //true

To get the best performance for a given use-case the parameters of the bloom filter must be chosen wisely. So for example we could choose the Bloom filter to use 1000 Bits and then use the best number of hash functions for an expected amount of 6666 inserted elements. We choose Murmur as our hash function which is faster than cryptographic hash functions like MD5:

//Create a more customized Bloom filter
BloomFilter<Integer> bf2 = new FilterBuilder()
                .expectedElements(6666) //elements
                .size(10000) //bits to use
                .hashFunction(HashMethod.Murmur3) //our hash
                .buildBloomFilter();
print("#Hashes:" + bf2.getHashes()); //2
print(FilterBuilder.optimalK(6666, 10000)); //you can also do these calculations yourself

Bloom filters allow other cool stuff too. Consider for instance that you collected two Bloom filters which are compatible in their parameters. Now you want to consolidate their elements. This is achieved by ORing the respective Bit-Arrays of the Bloom filters:

//Create two Bloom filters with equal parameters
BloomFilter<String> one = new FilterBuilder(100, 0.1).buildBloomFilter();
BloomFilter<String> other = new FilterBuilder(100, 0.1).buildBloomFilter();
one.add("this");
other.add("that");
one.union(other);
print(one.contains("this")); //true
print(one.contains("that")); //true

The good thing about the union() operation is, that it returns the exact Bloom filter which would have been created, if all elements were inserted in one Bloom filter from the get-go. There is a similar intersect operation that ANDs the Bit-Arrays. It does however behave slightly different as it does not return the Bloom filter that only contains the intersection. It guarantees to have all elements of the intersection but the false positive rate might be slightly higher than that of the pure intersection:

other.add("this");
other.add("boggles");
one.intersect(other);
print(one.contains("this")); //true
print(one.contains("boggles")); //false

The Filter Builder

The FilterBuilder is used to configure Bloom filters before constructing them. It will try to infer and compute any missing parameters optimally and preconfigured with sensible defaults (documented in its JavaDoc). For instance if you only specified the number of expected elements and the false positive probability, it will compute the optimal bit size and number of hash functions. To construct a filter, you can either call buildBloomFilter or buildCountingBloomFilter or you can pass the builder to a specific Bloom filter implementation to construct it.

Counting Bloom Filter

The Counting Bloom filter allows object removal. For this purpose it has binary counters instead of simple bits. The amount of bits c per counter can be set. If you expect to insert elements only once, the probability of a Bit overflow is very small for c = 4 : 1.37 10^-15 m for up to n inserted elements (details). For those use-cases 4 bits are usually the most space-efficient choice. The default however is 16 bits, so you don't have to worry about counter overflow with the downside of some space overhead. Keep in mind that if you insert items more than once you need larger counters, i.e. roughly Log(maximum_inserts)/Log(2) + 4 bits. And it is in fact useful to insert non-unique items since then you can perform frequency estimation ("how often have I seen this item?") using addAndEstimateCount and getEstimatedCount.

//Create a Counting Bloom filter that has a FP rate of 0.01 when 1000 are inserted
//and uses 4 bits for counting
CountingBloomFilter<String> cbf = new FilterBuilder(1000, 0.01).buildCountingBloomFilter();
cbf.add("http://google.com");
cbf.add("http://twitter.com");
print(cbf.contains("http://google.com")); //true
print(cbf.contains("http://twitter.com")); //true

If you insert one distinct item multiple times, the same counter always get updated so you should pick a higher c so that 2^c > inserted_copies. When 8, 16, 32, 64 bits are specified as the counter size, internally an optimized short-, byte-, int- resp. long-array will be used, whereas other sizes will use a custom bit vector build on the <a href="http://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html">Java BitSet. For optimal performance in terms of time complexity you should therefore prefer 8, 16, 32, 64 counting bits.

The Counting Bloom Filter extends the normal Bloom Filter by remove and removeAll methods:

//What only the Counting Bloom filter can do:
cbf.remove("http://google.com");
print(cbf.contains("http://google.com")); //false

To handle overflows (which is unlikely to ever be an issue) you can set an overflow callback:

//Use the Memory Bloom filter explicitly (for the overflow method):
FilterBuilder fb = new FilterBuilder(1000, 0.01).countingBits(4);
CountingBloomFilterMemory<String> filter = new CountingBloomFilterMemory<>(fb);
filter.setOverflowHandler(() -> print("ups"));
for (int i = 1; i < 20; i++) {
        print("Round " + i);
        filter.add("http://example.com"); //Causes onOverflow() in Round >= 16
}

To understand the inner workings of the Counting Bloom filter lets actually look at the bits of a small filter:

CountingBloomFilter<String> small = new FilterBuilder(3, 0.2)
                .countingBits(4)
                .buildCountingBloomFilter();
small.add("One"); small.add("Two"); small.add("Three");
print(small.toString());

This gives:

Bloom Filter Parameters: size = 11, hashes = 3, Bits: {0, 2, 6, 8, 10}
1 0001
0 0000
1 0001
0 0000
0 0000
0 0000
1 0001
0 0000
1 0001
0 0000
1 0101

The Counting Bloom filter thus has a bit size of 11, uses 3 hash functions and 4 bits for counting. The first row is the materialized bit array of all counters > 0. Explicitly saving it makes contains calls fast and generation when transferring the Counting Bloom Filter flattened to a Bloom filter.

Redis Bloom Filters

Bloom filters are really interesting as they allow very high throughput and minimal latency for adding and querying (and removing). Therefore you might want to use them across the boundaries of a single machine. For instance imagine you run a large scale web site or web service. You have a load balancer distributing the request load over several front-end web servers. You now want to store some information with a natural set structure, say, you want to know if a source IP address has accessed the requested URL in the past. You could achieve that by either explicitly storing that information (probably in a database) which will soon be a bottleneck if you serve billions of requests a day. Or you employ a shared Bloom filter and accept a small possibility of false positives.

These kind of use-cases are ideal for the Redis-backed Bloom filters of this library. They have the same Java Interfaces as the normal and Counting Bloom filter but store the Bloom filter bits in the in-memory key-value store Redis.

Reasons to use these Redis-backed Bloom filters instead of their pure Java brothers are:

Using the Redis-backed Bloom filter is straightforward:

  1. Install Redis. This is extremely easy: see Redis Installation.
  2. Start Redis with $ redis-server. The server will listen on port 6379.
  3. In your application (might be on a different machine) instantiate a Redis-backed Bloom filter giving the IP or host name of Redis and its port and the number of concurrent connections to the FilterBuilder using redisHost, redisPort, redisConnections.

The Redis-backed Bloom filters have the same interface as the normal Bloom filters and can be constructed through the FilterBuilder, too:

String host = "localhost";
int port = 6379;
String filterName = "normalbloomfilter";
//Open a Redis-backed Bloom filter
BloomFilter<String> bfr = new FilterBuilder(1000, 0.01)
    .name(filterName) //use a distinct name
    .redisBacked(true)
    .redisHost(host) //Default is localhost
    .redisPort(port) //Default is standard 6379
    .buildBloomFilter();

bfr.add("cow");

//Open the same Bloom filter from anywhere else
BloomFilter<String> bfr2 = new FilterBuilder(1000, 0.01)
    .name(filterName) //load the same filter
    .redisBacked(true)
    .buildBloomFilter();
bfr2.add("bison");

print(bfr.contains("cow")); //true
print(bfr.contains("bison")); //true

The Redis-backed Bloom filters are concurrency/thread-safe at the backend as-well-as in Java. That means you can concurrently insert from any machine without running into anomalies, inconsistencies or lost data. The Redis-backed Bloom filters are implemented using efficient Redis bit arrays. They make heavy use of pipelining so that every add and contains call only needs one round-trip. This is the most performance critical aspect and usually not found in other implementations which need one round-trip for every Bit or worse. Moreover, Redis connections are pooled so they are reused, while profiting from concurrent use.

The Redis-backed Bloom filters save their metadata (like number and kind of hash functions) in Redis, too. Thus other clients can easily to connect to a Redis instance that already holds a Bloom filter with a given name and specify whether to use or overwrite it.

Redis Counting Bloom Filters

The Redis Counting Bloom filter saves the counters as separate counters in a compact Redis hash and keeps the materialized flat Bloom filter as a bit array. It is compatatible with Redis 2.4 or higher.

//Open a Redis-backed Counting Bloom filter
CountingBloomFilter<String> cbfr = new FilterBuilder(10000, 0.01)
                .name("myfilter")
                .overwriteIfExists(true) //instead of loading it, overwrite it if it's already there
                .redisBacked(true)
                .buildCountingBloomFilter();
        cbfr.add("cow");

        //Open a second Redis-backed Bloom filter with a new connection
        CountingBloomFilter<String> bfr2 = new FilterBuilder(10000, 0.01)
                .name("myfilter") //this time it will be load it
                .redisBacked(true)
                .buildCountingBloomFilter();
        bfr2.add("bison");
        bfr2.remove("cow");

        print(cbfr.contains("bison")); //true
        print(cbfr.contains("cow")); //false

Redis Bloom Filter Read Slaves

If your workloads on the Bloom filter are really high-throughput you can leverage read-slaves. They will be queried for any reading operations: contains, fetching of the bit set, estimation methods (population, count, etc.):

CountingBloomFilter<String> filter = new FilterBuilder(m,k)
                .name("slavetest")
                .redisBacked(true)
                .addReadSlave(host, port +1); //add slave
                .addReadSlave(host, port +2); //and another
                .buildCountingBloomFilter() //or a normal one

filter.containsAll(items); //directed to the slave
filter.getEstimatedPopulation(); //that one too
filter.getEstimatedCount("abc"); //dito
filter.getBitSet(); //and again

Redis Sentinel Bloom Filters

To configure a Bloom Filter to use Sentinel to find the master Redis node, when building the FilterBuilder explicitly define a Sentinel configuration and provide your own Pool.

In the following example the Sentinel Nodes are a simple Set of form "host:port" and the sentinelClusterName is the name of Sentinel you want to connect to

        return new BloomFilterRedis<>(new FilterBuilder(n, p).hashFunction(hm)
                .redisBacked(true)
                .name(name)
                .pool(RedisPool.sentinelBuilder()
                      .master(sentinelClusterName)
                      .sentinels(getSentinelNodes())
                      .database(database)
                      .redisConnections(connections)
                      .build())
                .overwriteIfExists(overwrite)
                .redisConnections(connections).complete());

JSON Representation

To easily transfer a Bloom filter to a client (for instance via an HTTP GET) there is a JSON Converter for the Bloom filters. All Bloom filters are implemented so that this generation option is very cheap (i.e. just sequentially reading it from memory). It works for all Bloom filters including the ones backed by Redis.

BloomFilter<String> bf = new FilterBuilder().expectedElements(50).falsePositiveProbability(0.1).buildBloomFilter();
bf.add("Ululu");
JsonElement json = BloomFilterConverter.toJson(bf);
print(json); //{"size":240,"hashes":4,"HashMethod":"MD5","bits":"AAAAEAAAAACAgAAAAAAAAAAAAAAQ"}
BloomFilter<String> otherBf = BloomFilterConverter.fromJson(json);
print(otherBf.contains("Ululu")); //true

JSON is not an ideal format for binary content (Base64 only uses 64 out of 94 possible characters) but it's highly interoperable and easy to read which outweighs the slight waste of space. Combining it with a Content-Encoding like gzip usually compensates the overhead.

Moreover, the Memory Counting Bloom filter can also be serialized and deserialized in the normal Java way.

Hash Functions

There is a detailed description of the available hash functions in the Javadocs of the HashMethod enum. Hash uniformity (i.e. all bits of the Bloom filter being equally likely) is of great importance for the false positive rate. But there is also an inherent trade-off between hash uniformity and speed of computation. For instance cryptographic hash functions have very good distribution properties but are very CPU intensive. Pseudorandom number generators like the linear congruential generator are easy to compute but do not have perfectly random outputs but rather certain distribution patterns which for some inputs are notable and for others are negligible. The implementations of all hash functions are part of the BloomFilter class and use tricks like rejection sampling to get the best possible distribution for the respective hash function type.

Here is a Box plot overview of how good the different hash functions perform (Intel i7 with 4 cores, 16 GB RAM). The configuration is 100000 hashes using k = 10, m = 1000 averaged over 20 runs.

Speed of computation doesn't tell anything about the quality of hash values. A good hash function is one, which has a discrete uniform distribution of outputs. That means that every bit of the Bloom filter's bit vector is equally likely to bet set. To measure if and how good the hash functions follow a uniform distribution goodness of fit Chi-Square hypothesis tests are the mathematical instrument of choice.

Here are some of the results. The inputs are random strings. The p-value is the probability of getting a statistical result that is at least as extreme as the obtained result. So the usual way of hypothesis testing would be rejecting the null hypothesis ("the hash hash function is uniformly distributed") if the p-value is smaller than 0.05. We did 100 Chi-Square Tests:

If about 5 runs fail the test an 95 pass it, we can be very confident that the hash function is indeed uniformly distributed. For random inputs it is relatively easy though, so we also tested other input distribution, e.g. increasing integers:

Here the LCG is too evenly distributed (due to its modulo arithmetics) which is a good thing here, but shows, that LCGs do not have a random uniform distribution.

The performance optimization of using two hash functions and combining them through the formula hash_i = hash1 + i * hash2 as suggested by Kirsch and Mitzenmacher is theoretically sound as asymptotically hash values are perfectly uniform given to perfect hash values. In practice however, the distribution grows uneven for some inputs (the Cassandra team which uses this trick should have a look at that).

Now a real example of inserting random words in the Bloom filter with the resulting false positive rate after 30000 inserted elements demanding a false positive probability of 0.01:

Hash functionSpeed  (ms)f during insert (%)f final (%)
RNG80.5292510,1381,024
CarterWegman1007.667430,1990,956
SecureRNG309.6195820,1850,902
CRC3267.4845890,1211,007
Adler3283.32796810,07422,539
Murmur2100.205180,1751,047
Murmur3100.2439020,1890,993
Murmur3KirschMitzenmacher45.9994540,1620,852
FNVWithLCG42.258810,1450,960
MD23635.4655650,1380,869
MD5207.8235320,1480,936
SHA1213.7559360,1890,923
SHA256223.0604220,1651,054
SHA384176.0033450,1650,832
SHA512172.4446480,1521,064

In summary, cryptographic hash functions offer the most consistent uniform distribution, but are slightly more expensive to compute. LCGs, for instance Java Random, perform quite well in most cases and are cheap to compute. The best compromise seems to be the Murmur 3 hash function, which has a good distribution and is quite fast to compute.

It's also possible to provide a custom hash function:

BloomFilter<String> bf = new FilterBuilder(1000, 0.01)
    .hashFunction((value, m, k) -> null)
    .buildBloomFilter();

Performance

To get meaningful results, the Bloom filters should be tested on machines where they are to be run. The test package contains a benchmark procedure (the test packages relies on the Apache Commons Math library):

//Test the performance of the in-memory Bloom filter
BloomFilter<String> bf = new FilterBuilder(100_000, 0.01).hashFunction(HashMethod.Murmur3).buildBloomFilter();
MemoryBFTest.benchmark(bf, "Normal Bloom Filter", 1_000_000);

This gives over 2 Mio operations per second (on my laptop):

Normal Bloom Filter
hashes = 7 falsePositiveProbability = 3.529780138533512E-281 expectedElements = 1000000 size = 958506
add(): 0.687s, 1455604.0757 elements/s
addAll(): 0.47s, 2127659.5745 elements/s
contains(), existing: 0.472s, 2118644.0678 elements/s
contains(), nonexisting: 0.445s, 2247191.0112 elements/s
100000 hash() calls: 0.008s, 1.25E7 elements/s
Hash Quality (Chi-Squared-Test): p-value = 0.8041807628127277 , Chi-Squared-Statistic = 957318.7388845441

The Redis-backed and Counting Bloom filters can be tested similarly.

Overview of Probabilistic Data Structures

Data Structure Set membership: „Have I seen this item before?“ Frequency estimation: “How many of this kind have I seen?” Cardinality estimation: “How many distinct items have I seen in total” Item removal Persistence and distributed access
Memory Bloom Filter Yes, with configurable false positive probability, O(1) No Yes, O(#bits) No No
Memory Counting Bloom Filter Yes, with configurable false positive probability, O(1) Yes (Minimum Selection Algorithm), O(1) Yes, O(#bits) Yes, O(1) No
Redis Bloom Filter Yes, with configurable false positive probability, O(1), single roundtrip, scalable through read slaves No Yes, O(#bits), single roundtrip, scalable through read slaves No Yes, configurable Redis persistence & replication
Redis Counting Bloom Filter Yes, with configurable false positive probability, O(1) , single roundtrip, scalable through read slaves Yes (Minimum Selection Algorithm), O(1) , single roundtrip, scalable through read slaves Yes, O(#bits), single roundtrip, scalable through read slaves Yes, O(1), in average 2 roundtrips Yes, configurable Redis persistence & replication
Other sketches (not part of this lib) Hashsets, Bitvectors, Cuckoo Filter Count-Min-Sketch, Count-Mean-Sketch K-Minimum-Values, HyperLogLog

Up next

License

This Bloom filter library is published under the very permissive MIT license, see LICENSE file.