FastFilter / fastfilter_java

Fast Approximate Membership Filters (Java)
Apache License 2.0
238 stars 27 forks source link

XorBinaryFuse8 failed to construct on some data, at around 11511 entries #32

Closed konghuarukhr closed 2 years ago

konghuarukhr commented 2 years ago

I find it is not fixed, this is another construction failure:

public static void main(String[] args) {
int n = 11511;
long[] keys = new long[n];
for (int i = 0; i < n; i++) {
keys[i] = i;
}
XorBinaryFuse8.construct(keys);
}

will report:


...

11511 keys; arrayLength 14336 reverseOrderPos 6371 Exception in thread "main" java.lang.IllegalStateException at org.fastfilter.xor.XorBinaryFuse8.addAll(XorBinaryFuse8.java:219)



_Originally posted by @konghuarukhr in https://github.com/FastFilter/fastfilter_java/issues/31#issuecomment-997417882_
thomasmueller commented 2 years ago

This very much looks like https://github.com/FastFilter/xorfilter/pull/23

thomasmueller commented 2 years ago

I also found other cases where construction requires more retries than I would expect... One way to resolve the issue is to change the constants in calculateSegmentLength and calculateSizeFactor; I'm trying that now.

thomasmueller commented 2 years ago

I found constants that are more reliable...

--- a/fastfilter/src/main/java/org/fastfilter/xor/XorBinaryFuse8.java
+++ b/fastfilter/src/main/java/org/fastfilter/xor/XorBinaryFuse8.java
@@ -42,7 +42,8 @@ public class XorBinaryFuse8 implements Filter {
     static int calculateSegmentLength(int arity, int size) {
         int segmentLength;
         if (arity == 3) {
-            segmentLength = 1 << (int) Math.floor(Math.log(size) / Math.log(3.33) + 2.25);
+             segmentLength = 1 << (int) Math.floor(Math.log(size) / Math.log(3.33) + 2.11);
@@ -55,7 +56,8 @@ public class XorBinaryFuse8 implements Filter {
     static double calculateSizeFactor(int arity, int size) {
         double sizeFactor;
         if (arity == 3) {
-            sizeFactor = Math.max(1.125, 0.875 + 0.25 * Math.log(1000000) / Math.log(size));
+             sizeFactor = Math.max(1.125, 0.88 + 0.25 * Math.log(1000000) / Math.log(size));

Test case:

    @Test
    public void smallSizes() {
        for (int n = 1; n < 20000; n++) {
            long[] keys = new long[n];
            for (int i = 0; i < n; i++) {
                keys[i] = i;
            }
            XorBinaryFuse8 f = XorBinaryFuse8.construct(keys);
            if (n % 100 == 0) {
                System.out.println("n=" + n + " " + f.toString());
            }            
        }
        for (int n = 20000, x = 0; n < 1000000; n = (int) ((n * 1.0005) + 10), x++) {
            long[] keys = new long[n];
            for (int i = 0; i < n; i++) {
                keys[i] = i;
            }
            XorBinaryFuse8 f = XorBinaryFuse8.construct(keys);
            if (x % 100 == 0) {
                System.out.println("n=" + n + " " + f.toString());
            }
        }
    }

We would then have a maximum of around 30 re-tries at size 2100 .. 2116 (segment length), but no more than that. This is roughly in the middle of this segment length; we have 9 segments.

@lemire I wonder what is your opinion here...

lemire commented 2 years ago

@thomasmueller I tested both the C version as well as the Go version with n = 11511 and it succeeds but there are many more than 30 re-tries. I am not sure it is a source of concern.

We could update our constants.

thomasmueller commented 2 years ago

Ah, I think the Java version is worse because it uses a different "reduce" function (which uses only 32 bits, instead of 64 bits like in the C / C++ version). I will see if this can be fixed. And then I will to reduce the number of retries.

thomasmueller commented 2 years ago

One of the challenges is that Java still doesn't support an "unsigned 64-bit multiply high" -- see https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8188044 -- so we can't have exactly the same code in Java and C / C++, if we want to keep it at the current (high) performance.

However, I think I found constants that work well. Now I need to write a proper test case and clean this up.

lemire commented 2 years ago

@thomasmueller The link you offer suggest that this was fixed in Java 18.

thomasmueller commented 2 years ago

@lemire you are right it is fixed in Java 18 (I initially didn't understand this part sorry), but I think we should support older versions of Java as well.

I made a mistake and used a different hash function to search better constants... so I need to test again, and it will take a bit longer.

thomasmueller commented 2 years ago

OK, I believe it is now fixed. The main problem was actually this:

    // use a new random numbers
    seed++;

That was in no way correct... I replaced it with seed = Hash.randomSeed(); now. It was specially a problem if you use incremental keys, like you did in your test!

But I also changed the segment sizes a bit, because I found that construction requires less retries. Now segments tend to be a bit smaller, specially for small sets. And now a IllegalArgumentException is thrown if construction fails.

thomasmueller commented 2 years ago

@konghuarukhr Let me know if this resolves the problem!

konghuarukhr commented 2 years ago

I don't find any bad case now. 👍