FastFilter / xor_singleheader

Header-only binary fuse and xor filter library
Apache License 2.0
348 stars 32 forks source link

Probability of filter construction failure #38

Closed BaiqingL closed 2 years ago

BaiqingL commented 2 years ago

Has there been profiling done to estimate the probability of filter construction failure? I'm not an expert so I'd thought to ask :)

BaiqingL commented 2 years ago

Actually, the reason I ask that is when I attempt to construct a filter with ~5000 keys, it hangs in the binary_fuse16_populate function and gets stuck somewhere within the for (int loop = 0; true; ++loop) part.

lemire commented 2 years ago

As the README states, it should not fail in practice unless you have many duplicated values.

You can try to construct 1000000 5000-long filters and see how often binary_fuse16_populate succeeds. It should be close to 100%.

What happens if you run the following function...?

  // we construct many 5000-long input cases and check the probability of failure.
  size_t size = 5000;
  uint64_t *big_set = (uint64_t *)malloc(sizeof(uint64_t) * size);
  binary_fuse16_t filter;
  binary_fuse16_allocate(size, &filter);
  size_t failure = 0;
  size_t total_trials = 1000000;

  for(size_t trial = 0; trial <= 1000; trial++) {
    for (size_t i = 0; i < size; i++) {
      big_set[i] = rand() + (((uint64_t) rand()) << 32);
    }
    if(!binary_fuse16_populate(big_set, size, &filter)) {
      failure++;
    }
  }
  printf("failures %zu out of %zu\n\n", failure, total_trials);
  free(big_set);

gets stuck somewhere within the for (int loop = 0; true; ++loop) part.

It should not get stuck because the loop variable breaks the loop once it reaches 100. In the worst case, it should end with the stderr message "Too many iterations. Are all your keys unique?"

https://github.com/FastFilter/xor_singleheader/blob/34f91c9a721fc7c494b5dd3bc1c51d3364e3a90a/include/binaryfusefilter.h#L524

Can dump the 5000 keys and propose a reproducible test case?

Please make sure that your 5000 keys are distinct. That is, you should provide a set. We tolerate a few repeated values, but if you have too many then you should go back and prune the duplicates.

The expectation is that you have a set of values, you construct 64-bit hash values, and then you want to construct a filter out of the result.

BaiqingL commented 2 years ago

After some testing I think it's actually due to my improper use of this code in a C++ project that is creating some issues. I want to quickly ask if given C++ capabilities like smart pointers and vectors etc, what would the following variables look like in binary_fuse16_populate look like?

uint64_t *reverseOrder = (uint64_t *)calloc((size + 1), sizeof(uint64_t));
uint32_t capacity = filter->ArrayLength;
uint32_t *alone = (uint32_t *)malloc(capacity * sizeof(uint32_t));
uint8_t *t2count = (uint8_t *)calloc(capacity, sizeof(uint8_t));
uint8_t *reverseH = (uint8_t *)malloc(size * sizeof(uint8_t));
uint64_t *t2hash = (uint64_t *)calloc(capacity, sizeof(uint64_t));
lemire commented 2 years ago

The xor_singleheader library is a C library (which will work in C++). It is not clear how you can just make C++ substitution under the hood. I would discourage it.

Maybe you are under the impression that C++ code is superior to C. I would disagree with such a sentiment.

You can, of course, wrap up our C code in C++...


class BinaryFuseSingle {
public:
    binary_fuse8_t filter; // let us expose the struct. to avoid indirection
    explicit BinaryFuseSingle(const size_t size) {
        if (!binary_fuse8_allocate(size, &filter)) {
            throw ::std::runtime_error("Allocation failed");
        }
    }
    ~BinaryFuseSingle() {
        binary_fuse8_free(&filter);
    }
    bool AddAll(const uint64_t* data, const size_t start, const size_t end) {
        return binary_fuse8_populate(data + start, end - start, &filter);
    }
    inline bool Contain(uint64_t &item) const {
        return binary_fuse8_contain(item, &filter);
    }
    inline size_t SizeInBytes() const {
        return binary_fuse8_size_in_bytes(&filter);
    }
    BinaryFuseSingle(BinaryFuseSingle && o) : filter(o.filter)  {
        o.filter.Fingerprints = nullptr; // we take ownership for the data
    }
private:
    BinaryFuseSingle(const BinaryFuseSingle & o) = delete;
};
BaiqingL commented 2 years ago

It's just that the previous filter implementation is a bloom filter and the entire project was already written in C++, so it's easier for me to integrate the filter code into something that looks more like it using vectors and smart pointers rather than keeping everything strictly in C. I will take a look at the wrapper, thanks!

BaiqingL commented 2 years ago

I actually found where it hangs, it hangs during while (reverseOrder[startPos[segment_index]] != 0) within the populate function. @lemire Could you please help me explain what does this part of the code does? Thanks!

lemire commented 2 years ago

Please see page 5 in the paper...

https://arxiv.org/pdf/2201.01174.pdf

Bottom of the page, first element of the enumeration.

lemire commented 2 years ago

Note that we have published a C++ version of this code… https://github.com/FastFilter/fastfilter_cpp/blob/master/src/xorfilter/3wise_xor_binary_fuse_filter_lowmem.h

This being said, you are probably failing to properly zero your arrays.

BaiqingL commented 2 years ago

Note that we have published a C++ version of this code… https://github.com/FastFilter/fastfilter_cpp/blob/master/src/xorfilter/3wise_xor_binary_fuse_filter_lowmem.h

This being said, you are probably failing to properly zero your arrays.

You're completely right, I found out after the code loops once reverseOrder seems to still have results.

lemire commented 2 years ago

You might want to zero it.

BaiqingL commented 2 years ago

Yup! I was able to fix it. Thanks!