Closed GoogleCodeExporter closed 9 years ago
Original comment by kurt.kluever
on 24 Aug 2012 at 6:11
I wouldn't say that an *arbitrary bad* FPP and "/ by zero" is a RFE. It happens
for big sizes only, but nobody said they aren't allowed.
Original comment by Maaarti...@gmail.com
on 19 Sep 2012 at 9:08
Good suggestions, thanks for reading deeply the code! This didn't get nearly
enough focused eyeballs yet, unfortunately.
Hmm, there was a version that I was extracting two longs instead of two ints
from the hash, can't remember why I went with ints after all. I suspect I
didn't see any fpp improvement from the longs, and went for simpler arithmetic,
ignoring the case of really large BFs where longs are preferable anyway.
And yes, the modulo operator is unnecessarily performed too many times, well
spotted!
The other bugs related to very large BFs are straightforward as well.
Guava team, can someone pick this up? Pretty much the patch sums it up, though
probably without the HashCode.asSecondLong(), and the for-loops are written in
a slightly weird way. But since the serial form is designed for changes like
this, it's easy. It's a bit unlikely that anyone is affected by any of these
cases, so it's not quite severe, but it's still nice to have this in a better
shape anticipating for such cases.
Maartin, could you file a separate request for the "figure out the memory size
of a BF"? I wish we had that as well, especially since there is no constructor
that is configured by that.
Original comment by andr...@google.com
on 19 Sep 2012 at 9:40
Without `HashCode.asSecondLong()` you have to either use `HashCode.asBytes`
which means cloning or cast and access `HashCode.bytes` directly. Neither
alternative feels nicer than `asSecondLong`.
Agreed that this bug unprobably hits someone soon.
The separate RFE is the issue 1156.
Original comment by Maaarti...@gmail.com
on 2 Oct 2012 at 10:19
Kurt (Louis?), could someone take care of this? Inevitably more people are
going to hit these bugs in very big bloomfilters
Original comment by andr...@google.com
on 15 Oct 2012 at 7:33
Original comment by kak@google.com
on 23 Oct 2012 at 4:47
Can we revisit this issue? Here is a version of the patch that is updated
against the latest git master branch. The majority of the implementation is
faithful to the original patch. One small bugfix: in BitArray#get() and
BitArray#put() the right shift operations should be unsigned. I also cleaned up
the looping constructs to make them easier to understand. I introduced
HashCode#asLeastSignificantLong() and HashCode#asMostSignificantLong() methods.
The names of the methods are intended to be similar to the names of the methods
in java.util.UUID. With the introduction of a new enum in BloomFilterStrategies
these patches are backwards compatible. This fixes
https://code.google.com/p/guava-libraries/issues/detail?id=1656.
Original comment by spie...@addthis.com
on 6 Mar 2014 at 11:50
Attachments:
Original comment by kevinb@google.com
on 12 Mar 2014 at 2:19
OK, I think we've got a version we're going to check in. Can both spiegel@ and
Maaartinus patch and test out this new BF strategy (patch against head, not
Guava 16)? Also if you see anything obvious, please point that out as well.
MURMUR128_MITZ_64() {
@Override
public <T> boolean put(T object, Funnel<? super T> funnel,
int numHashFunctions, BitArray bits) {
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).asBytes();
long bitSize = bits.bitSize();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
boolean bitsChanged = false;
long index = hash1 + hash2;
for (int i = 0; i < numHashFunctions; i++) {
index &= Long.MAX_VALUE; // make it positive
index %= bitSize; // make it indexable
bitsChanged |= bits.set(index);
index += hash2;
}
return bitsChanged;
}
@Override
public <T> boolean mightContain(T object, Funnel<? super T> funnel,
int numHashFunctions, BitArray bits) {
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).asBytes();
long bitSize = bits.bitSize();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
long index = hash1 + hash2;
for (int i = 0; i < numHashFunctions; i++) {
index &= Long.MAX_VALUE; // make it positive
index %= bitSize; // make it indexable
if (!bits.get(index)) {
return false;
}
index += hash2;
}
return true;
}
private /* static */ long lowerEight(byte[] bytes) {
return Longs.fromBytes(
bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}
private /* static */ long upperEight(byte[] bytes) {
return Longs.fromBytes(
bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}
Please let me know.
Thanks,
-kak
Original comment by kak@google.com
on 27 Mar 2014 at 8:50
Also we're having a bit of internal debate over the methodology of:
if (index < 0) {
index = ~index;
}
vs.
index &= Long.MAX_VALUE;
They both accomplish similar goals, but should one be preferable to the other?
The branchless approach is more attractive runtime-wise, but it's not flipping
the bits (just making the result positive).
Any thoughts?
Original comment by kak@google.com
on 27 Mar 2014 at 9:38
Yes I don't think it matters what happens to the bits as long as the
transformation doesn't mess up the distribution of values and that the result
is always positive. Another thought I had was applying index >>>= 1. It removes
the sign bit and drops the least significant bit. All three proposals are OK
with me. I kind of prefer the two solutions that do not have a branch but for
me that's not a big concern.
Original comment by spie...@addthis.com
on 27 Mar 2014 at 9:43
Running spiegel@' BloomFilterTest says
Observed false positive rate: 1.4666666666666666E-7
Expected false positive rate:9.196321345133148E-8
which is much better than it was but still wrong.
I see no point in `if (nextHash < 0) nextHash = ~nextHash;`. It gets compiled
into a conditional move or maybe into `index ^= index >> 63`, but IMHO it's not
enough mixing for the buck. If the last mixing operation was a multiplication,
the right shift would be clearly better than masking, but given how
`Murmur3_128Hasher.makeHash` ends, I guess that the LSb and MSb are about
equally good. Given that masking needs a 64-bit operand, I'd go for the shift.
Concerning the speed, there's something wrong:
normal:
https://microbenchmarks.appspot.com/runs/7f9093a6-52c7-463d-8438-19f3295b4723
scaled:
https://microbenchmarks.appspot.com/runs/85f26479-000a-4c06-82c5-cd4254013411#r:
scenario.benchmarkSpec.parameters.desiredFpp,scenario.benchmarkSpec.parameters.e
xpectedInsertions,scenario.benchmarkSpec.parameters.strategy&c:scenario.benchmar
kSpec.methodName
It may be caused by the 64-bit division or by `asBytes`. Both could be
eliminated and I'd bet that most of my optimizations for 32 bits would apply
here as well. I guess I'll give it a try next week.
Original comment by Maaarti...@gmail.com
on 28 Mar 2014 at 5:04
Attachments:
There are some errors in my attached BloomFilterStrategies, especially many
shifts must be unsigned. However, fixing it doesn't change the speed.
Original comment by Maaarti...@gmail.com
on 28 Mar 2014 at 5:31
FYI, in the current iteration of this we're avoiding the asBytes() call with a
package-private backdoor on HashCode that (for a byte-array based HashCode such
as this one) returns the internal array directly without cloning.
Original comment by cgdecker@google.com
on 28 Mar 2014 at 4:27
Nice. Still, my 64-bit version is by 30-50% faster than the original
MURMUR128_MITZ_32 (which needs `asLong()` only). Currently it's too hacky, but
I'll come back when I polish it a bit.
I found a nice modulus eliminating hack.
I think that the original BloomFilterTest was (slightly) wrong, with my updated
version the false positive rates are fine for the 64-bit version.
Original comment by Maaarti...@gmail.com
on 28 Mar 2014 at 5:03
Attachments:
First, congrats on using nearly 30 minutes walltime of my 6-Core Xeon
workstation with the GuavaBloomTest ;-)
Second, here's are my current results:
MURMUR128_MITZ_32
Desired false positive rate: 1.0E-75
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.83005825E-316
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.9992650567524934
Factor: Infinity
MURMUR128_MITZ_64
Desired false positive rate: 1.0E-75
Inserting 0%^[ 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 1.0003117180465422E-75
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00
These numbers seem wrong. Is there something wrong with the tool?
Original comment by kak@google.com
on 28 Mar 2014 at 6:51
Oh sorry.... I fixed the formula but blew the inputs. Change it to something
sane (even 1e-10 is fine, but 1e-75 was only an exercise with maximum memory).
I believe that the numbers are correct: MURMUR128_MITZ_32 is completely broken
because of overflow. It sets only a few bits, so it reports unbelievable small
expected fpp and end up with an terrible fpp.
For MURMUR128_MITZ_64 the expected fpp was 1e-75 and there was no single false
positive. That's fine. With proper figures you'll see proper data. Sorry again.
Original comment by Maaarti...@gmail.com
on 28 Mar 2014 at 7:23
OK, I re-ran the tests with 1e-10. Glad to see we have a reproducible failure
case for MITZ_32.
MURMUR128_MITZ_32 (what's currently committed in Guava):
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 1.6110489235558792E-16
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0011919304361762886
Factor: 7398474489188.57
MURMUR128_MITZ_64 (my proposed fix, source is attached below):
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.998270307088567E-11
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00
MURMUR128_MITZ_64() {
@Override
public <T> boolean put(T object, Funnel<? super T> funnel,
int numHashFunctions, BitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); // no byte[] clone!
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
boolean bitsChanged = false;
long combinedHash = hash1 + hash2;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
combinedHash += hash2;
}
return bitsChanged;
}
@Override
public <T> boolean mightContain(T object, Funnel<? super T> funnel,
int numHashFunctions, BitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); // no byte[] clone!
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
long combinedHash = hash1 + hash2;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
return false;
}
combinedHash += hash2;
}
return true;
}
private /* static */ long lowerEight(byte[] bytes) {
return Longs.fromBytes(
bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}
private /* static */ long upperEight(byte[] bytes) {
return Longs.fromBytes(
bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}
};
I realize yours may contain a few more small optimizations, but I'd like to
prefer clarity and correctness over speed. If you can confirm that this fix is
correct, we can discuss the tradeoffs between clarity and speed.
Original comment by kak@google.com
on 28 Mar 2014 at 7:37
+1 this looks good. I'll rerun the tests and I expect them to pass. I'm going
to run the latest version (#18) in a profiler and look for any bottlenecks.
Original comment by spie...@addthis.com
on 28 Mar 2014 at 7:54
Looks good, maybe except for `long combinedHash = hash1 + hash2;`, where `long
combinedHash = hash1` would be both clearer and (a very tiny bit) faster
(there's really no point in combining the longs as the Murmur's last two steps
are `h1 += h2; h2 += h1;`).
The problem with *later* optimizations is serialization. When people start
saving their filter, you can't really change it anymore. You can create a new
strategy, but you should maintain them all forever. No nice
implementation-independent SerialForm works here.
Unfortunately, there's no single big optimization here, just a couple of 10%
speedup for 10 minutes headache trade-offs. The good news is that the new
version is faster than the broken one:
https://microbenchmarks.appspot.com/runs/5d98efb8-9013-460b-9c0f-a576e6d158ba
Original comment by Maaarti...@gmail.com
on 28 Mar 2014 at 8:33
I think I'm going to leave the extra addition in. The paper that's referenced
in the Javadoc, "Less Hashing, Same Performance: Building a Better Bloom
Filter" (http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/rsa.pdf) uses this
formula:
gi(x) = h1(x) + i*h2(x)
Which is what we've got now. I'd rather leave it as is to match the paper (for
clarity sake), than to tinker with it more.
Thanks for all the help Martin + Spiegel. Submitting shortly!
Original comment by kak@google.com
on 28 Mar 2014 at 8:41
Right, but they also write "In our context i will range from 0 up to some
number k − 1 to give k hash functions, and ...", so their very first `gi` is
`h1` rather than `h1+h2`.
Original comment by Maaarti...@gmail.com
on 28 Mar 2014 at 9:26
Ugh, not sure how I missed that...made the appropriate change and re-ran the
GuavaBloomTest. Results:
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.998725915640892E-11
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00
Original comment by kak@google.com
on 28 Mar 2014 at 9:42
This issue has been migrated to GitHub.
It can be found at https://github.com/google/guava/issues/<id>
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:13
Original comment by cgdecker@google.com
on 3 Nov 2014 at 9:08
Original issue reported on code.google.com by
Maaarti...@gmail.com
on 24 Aug 2012 at 5:40Attachments: