Balzanka / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

BloomFilter broken when really big #1119

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
**BUG 1:
When the number of bits in a BloomFilter gets high, it's FPP is much worse than 
expected. The culprit is the modular arithmetic in 
BloomFilterStrategies.MURMUR128_MITZ_32.

You compute `x%N` where `N=bits.size()` and `x` is uniformly distributed in 
range `0..Integer.MAX_VALUE`. For big `N`, `x%N` is far from uniform in range 
`0..N-1`. For example, with `N=3<<29`, values below `1<<29` are twice as 
probable as the others.

This non-uniformity leads to results like this:
desiredFpp   0.000001000000
expectedFpp  0.000000610089
realFpp      0.000003000000

Here, `desiredFpp` is the value used in `BloomFilter.create`, `expectedFpp` was 
reported after exactly `expectedInsertions` were done. Obviously, much fewer 
bits than expected were set. If this happened once, it might be a good luck but 
here it's a sign of this bug, as the `realFpp` shows.

This problem is well reproducible, it's no glitch caused by bad luck with 
selected values. AFAIK it concerns all versions since the switch from powers of 
two.

**BUG 2:
With "31149b4 The number of bits can reach Integer.MAX_VALUE now, rather than 
Integer.MAX_VALUE/64" another bug was introduced. The commit message is 
obviously wrong, as there can be allocated up to `Integer.MAX_VALUE` longs, 
allowing nearly `2**37` bits. However, the arithmetic is still int-based and 
allows to address only `2**31` bits. So most of the allocated memory get wasted.

Even worse, `bits.size()` may overflow leading to all kinds of disaster, like 
"/ by zero" (e.g. for expectedInsertions=244412641 and desiredFpp=1e-11) or 
using only 64 bits.

**INEFFICIENCY:
In `MURMUR128_MITZ_32` there are one modulus operation and one unpredictable 
branch per hash function. This is quite wasteful, as it's enough to compute 
modulus for the basic two hashes and than use conditional subtraction.

**ENHANCEMENT 1:
As the filter may take up to 16 GB, there should be a method to find out the 
memory consumption.

**ENHANCEMENT 2:
Possibly there could be a strategy using a power of two table, which may be 
faster. In case the speed up is non-negligible, such a strategy makes a lot of 
sense, as the additional memory (assuming rounding up) is not wasted at all -- 
you get better FPP.

QUESTION:
I see no reason for limiting `numHashFunctions` to 25.5 In the `SerialForm`, 
there's an `int`, so why?

**PROPOSED SOLUTION:
Because of serialized form compatibility, I'd suggest to leave 
MURMUR128_MITZ_32 alone, and create MURMUR128_MITZ_64, which
- extracts two longs instead of two ints from the HashCode
- uses long arithmetic for everything

The `BitArray` must use long indexes, long `bitCount()` and `size()`, too. This 
works with both strategies and the `SerialForm` needs no change.

For small filters (up to few millions bits, before the non-uniformity starts to 
make problems) it's fine to use the old strategy,
for larger the new one must get used. I'd suggest to use the new one for all 
new filters.

In order to get maximum speed, the following comes in mind:
- create package-private `HashCode.asSecondLong()`
- compute hash2 only if `numHashFunctions>1`

The attached patch solves it all.

Original issue reported on code.google.com by Maaarti...@gmail.com on 24 Aug 2012 at 5:40

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by kurt.kluever on 24 Aug 2012 at 6:11

GoogleCodeExporter commented 9 years ago
I wouldn't say that an *arbitrary bad* FPP and "/ by zero" is a RFE. It happens 
for big sizes only, but nobody said they aren't allowed.

Original comment by Maaarti...@gmail.com on 19 Sep 2012 at 9:08

GoogleCodeExporter commented 9 years ago
Good suggestions, thanks for reading deeply the code! This didn't get nearly 
enough focused eyeballs yet, unfortunately.

Hmm, there was a version that I was extracting two longs instead of two ints 
from the hash, can't remember why I went with ints after all. I suspect I 
didn't see any fpp improvement from the longs, and went for simpler arithmetic, 
ignoring the case of really large BFs where longs are preferable anyway. 

And yes, the modulo operator is unnecessarily performed too many times, well 
spotted!

The other bugs related to very large BFs are straightforward as well.

Guava team, can someone pick this up? Pretty much the patch sums it up, though 
probably without the HashCode.asSecondLong(), and the for-loops are written in 
a slightly weird way. But since the serial form is designed for changes like 
this, it's easy. It's a bit unlikely that anyone is affected by any of these 
cases, so it's not quite severe, but it's still nice to have this in a better 
shape anticipating for such cases.

Maartin, could you file a separate request for the "figure out the memory size 
of a BF"? I wish we had that as well, especially since there is no constructor 
that is configured by that. 

Original comment by andr...@google.com on 19 Sep 2012 at 9:40

GoogleCodeExporter commented 9 years ago
Without `HashCode.asSecondLong()` you have to either use `HashCode.asBytes` 
which means cloning or cast and access `HashCode.bytes` directly. Neither 
alternative feels nicer than `asSecondLong`.

Agreed that this bug unprobably hits someone soon.

The separate RFE is the issue 1156.

Original comment by Maaarti...@gmail.com on 2 Oct 2012 at 10:19

GoogleCodeExporter commented 9 years ago
Kurt (Louis?), could someone take care of this? Inevitably more people are 
going to hit these bugs in very big bloomfilters

Original comment by andr...@google.com on 15 Oct 2012 at 7:33

GoogleCodeExporter commented 9 years ago

Original comment by kak@google.com on 23 Oct 2012 at 4:47

GoogleCodeExporter commented 9 years ago
Can we revisit this issue? Here is a version of the patch that is updated 
against the latest git master branch. The majority of the implementation is 
faithful to the original patch. One small bugfix: in BitArray#get() and 
BitArray#put() the right shift operations should be unsigned. I also cleaned up 
the looping constructs to make them easier to understand. I introduced 
HashCode#asLeastSignificantLong() and HashCode#asMostSignificantLong() methods. 
The names of the methods are intended to be similar to the names of the methods 
in java.util.UUID. With the introduction of a new enum in BloomFilterStrategies 
these patches are backwards compatible. This fixes 
https://code.google.com/p/guava-libraries/issues/detail?id=1656.

Original comment by spie...@addthis.com on 6 Mar 2014 at 11:50

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 12 Mar 2014 at 2:19

GoogleCodeExporter commented 9 years ago
OK, I think we've got a version we're going to check in. Can both spiegel@ and 
Maaartinus patch and test out this new BF strategy (patch against head, not 
Guava 16)? Also if you see anything obvious, please point that out as well.

  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(T object, Funnel<? super T> funnel,
        int numHashFunctions, BitArray bits) {
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).asBytes();
      long bitSize = bits.bitSize();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      boolean bitsChanged = false;
      long index = hash1 + hash2;
      for (int i = 0; i < numHashFunctions; i++) {
        index &= Long.MAX_VALUE; // make it positive
        index %= bitSize; // make it indexable
        bitsChanged |= bits.set(index);
        index += hash2;
      }
      return bitsChanged;
    }

    @Override
    public <T> boolean mightContain(T object, Funnel<? super T> funnel,
        int numHashFunctions, BitArray bits) {
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).asBytes();
      long bitSize = bits.bitSize();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      long index = hash1 + hash2;
      for (int i = 0; i < numHashFunctions; i++) {
        index &= Long.MAX_VALUE; // make it positive
        index %= bitSize; // make it indexable
        if (!bits.get(index)) {
          return false;
        }
        index += hash2;
      }
      return true;
    }

    private /* static */ long lowerEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
    }

    private /* static */ long upperEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
    }

Please let me know.

Thanks,
-kak

Original comment by kak@google.com on 27 Mar 2014 at 8:50

GoogleCodeExporter commented 9 years ago
Also we're having a bit of internal debate over the methodology of:
  if (index < 0) {
    index = ~index;
  }
vs.
  index &= Long.MAX_VALUE;

They both accomplish similar goals, but should one be preferable to the other? 
The branchless approach is more attractive runtime-wise, but it's not flipping 
the bits (just making the result positive).

Any thoughts?

Original comment by kak@google.com on 27 Mar 2014 at 9:38

GoogleCodeExporter commented 9 years ago
Yes I don't think it matters what happens to the bits as long as the 
transformation doesn't mess up the distribution of values and that the result 
is always positive. Another thought I had was applying index >>>= 1. It removes 
the sign bit and drops the least significant bit. All three proposals are OK 
with me. I kind of prefer the two solutions that do not have a branch but for 
me that's not a big concern.

Original comment by spie...@addthis.com on 27 Mar 2014 at 9:43

GoogleCodeExporter commented 9 years ago
Running spiegel@' BloomFilterTest says
Observed false positive rate: 1.4666666666666666E-7
Expected false positive rate:9.196321345133148E-8
which is much better than it was but still wrong.

I see no point in `if (nextHash < 0) nextHash = ~nextHash;`. It gets compiled 
into a conditional move or maybe into `index ^= index >> 63`, but IMHO it's not 
enough mixing for the buck. If the last mixing operation was a multiplication, 
the right shift would be clearly better than masking, but given how 
`Murmur3_128Hasher.makeHash` ends, I guess that the LSb and MSb are about 
equally good. Given that masking needs a 64-bit operand, I'd go for the shift.

Concerning the speed, there's something wrong:

normal: 
https://microbenchmarks.appspot.com/runs/7f9093a6-52c7-463d-8438-19f3295b4723

scaled: 
https://microbenchmarks.appspot.com/runs/85f26479-000a-4c06-82c5-cd4254013411#r:
scenario.benchmarkSpec.parameters.desiredFpp,scenario.benchmarkSpec.parameters.e
xpectedInsertions,scenario.benchmarkSpec.parameters.strategy&c:scenario.benchmar
kSpec.methodName

It may be caused by the 64-bit division or by `asBytes`. Both could be 
eliminated and I'd bet that most of my optimizations for 32 bits would apply 
here as well. I guess I'll give it a try next week.

Original comment by Maaarti...@gmail.com on 28 Mar 2014 at 5:04

Attachments:

GoogleCodeExporter commented 9 years ago
There are some errors in my attached BloomFilterStrategies, especially many 
shifts must be unsigned. However, fixing it doesn't change the speed.

Original comment by Maaarti...@gmail.com on 28 Mar 2014 at 5:31

GoogleCodeExporter commented 9 years ago
FYI, in the current iteration of this we're avoiding the asBytes() call with a 
package-private backdoor on HashCode that (for a byte-array based HashCode such 
as this one) returns the internal array directly without cloning.

Original comment by cgdecker@google.com on 28 Mar 2014 at 4:27

GoogleCodeExporter commented 9 years ago
Nice. Still, my 64-bit version is by 30-50% faster than the original 
MURMUR128_MITZ_32 (which needs `asLong()` only). Currently it's too hacky, but 
I'll come back when I polish it a bit.

I found a nice modulus eliminating hack.

I think that the original BloomFilterTest was (slightly) wrong, with my updated 
version the false positive rates are fine for the 64-bit version.

Original comment by Maaarti...@gmail.com on 28 Mar 2014 at 5:03

Attachments:

GoogleCodeExporter commented 9 years ago
First, congrats on using nearly 30 minutes walltime of my 6-Core Xeon 
workstation with the GuavaBloomTest ;-)

Second, here's are my current results:

MURMUR128_MITZ_32
Desired false positive rate: 1.0E-75
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.83005825E-316
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.9992650567524934
Factor: Infinity

MURMUR128_MITZ_64
Desired false positive rate: 1.0E-75
Inserting 0%^[ 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 1.0003117180465422E-75
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00

These numbers seem wrong. Is there something wrong with the tool?

Original comment by kak@google.com on 28 Mar 2014 at 6:51

GoogleCodeExporter commented 9 years ago
Oh sorry.... I fixed the formula but blew the inputs. Change it to something 
sane (even 1e-10 is fine, but 1e-75 was only an exercise with maximum memory).

I believe that the numbers are correct: MURMUR128_MITZ_32 is completely broken 
because of overflow. It sets only a few bits, so it reports unbelievable small 
expected fpp and end up with an terrible fpp.

For MURMUR128_MITZ_64 the expected fpp was 1e-75 and there was no single false 
positive. That's fine. With proper figures you'll see proper data. Sorry again.

Original comment by Maaarti...@gmail.com on 28 Mar 2014 at 7:23

GoogleCodeExporter commented 9 years ago
OK, I re-ran the tests with 1e-10. Glad to see we have a reproducible failure 
case for MITZ_32.

MURMUR128_MITZ_32 (what's currently committed in Guava):
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 1.6110489235558792E-16
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0011919304361762886
Factor: 7398474489188.57

MURMUR128_MITZ_64 (my proposed fix, source is attached below):
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.998270307088567E-11
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00

  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(T object, Funnel<? super T> funnel,
        int numHashFunctions, BitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); // no byte[] clone!
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      boolean bitsChanged = false;
      long combinedHash = hash1 + hash2;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
        combinedHash += hash2;
      }
      return bitsChanged;
    }

    @Override
    public <T> boolean mightContain(T object, Funnel<? super T> funnel,
        int numHashFunctions, BitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); // no byte[] clone!
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      long combinedHash = hash1 + hash2;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
          return false;
        }
        combinedHash += hash2;
      }
      return true;
    }

    private /* static */ long lowerEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
    }

    private /* static */ long upperEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
    }
  };

I realize yours may contain a few more small optimizations, but I'd like to 
prefer clarity and correctness over speed. If you can confirm that this fix is 
correct, we can discuss the tradeoffs between clarity and speed.

Original comment by kak@google.com on 28 Mar 2014 at 7:37

GoogleCodeExporter commented 9 years ago
+1 this looks good. I'll rerun the tests and I expect them to pass. I'm going 
to run the latest version (#18) in a profiler and look for any bottlenecks.

Original comment by spie...@addthis.com on 28 Mar 2014 at 7:54

GoogleCodeExporter commented 9 years ago
Looks good, maybe except for `long combinedHash = hash1 + hash2;`, where `long 
combinedHash = hash1` would be both clearer and (a very tiny bit) faster 
(there's really no point in combining the longs as the Murmur's last two steps 
are `h1 += h2; h2 += h1;`).

The problem with *later* optimizations is serialization. When people start 
saving their filter, you can't really change it anymore. You can create a new 
strategy, but you should maintain them all forever. No nice 
implementation-independent SerialForm works here.

Unfortunately, there's no single big optimization here, just a couple of 10% 
speedup for 10 minutes headache trade-offs. The good news is that the new 
version is faster than the broken one: 
https://microbenchmarks.appspot.com/runs/5d98efb8-9013-460b-9c0f-a576e6d158ba 

Original comment by Maaarti...@gmail.com on 28 Mar 2014 at 8:33

GoogleCodeExporter commented 9 years ago
I think I'm going to leave the extra addition in. The paper that's referenced 
in the Javadoc, "Less Hashing, Same Performance: Building a Better Bloom 
Filter" (http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/rsa.pdf) uses this 
formula:

gi(x) = h1(x) + i*h2(x)

Which is what we've got now. I'd rather leave it as is to match the paper (for 
clarity sake), than to tinker with it more.

Thanks for all the help Martin + Spiegel. Submitting shortly!

Original comment by kak@google.com on 28 Mar 2014 at 8:41

GoogleCodeExporter commented 9 years ago
Right, but they also write "In our context i will range from 0 up to some 
number k − 1 to give k hash functions, and ...", so their very first `gi` is 
`h1` rather than `h1+h2`.

Original comment by Maaarti...@gmail.com on 28 Mar 2014 at 9:26

GoogleCodeExporter commented 9 years ago
Ugh, not sure how I missed that...made the appropriate change and re-ran the 
GuavaBloomTest.  Results:

Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.998725915640892E-11
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00

Original comment by kak@google.com on 28 Mar 2014 at 9:42

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:13

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08