SegmentLength max of 2^18

BaiqingL commented 2 years ago

In the paper I understood the filter requires segment length with the power of two, however the following segment: https://github.com/FastFilter/xor_singleheader/blob/177cf03e8573519a86f4fb34a626b619947d431d/include/binaryfusefilter.h#L439-L441 Why define a max? Are there problems that arises when we go over this amount?

thomasmueller commented 2 years ago

I think the hash function that is used wouldn't work correctly with larger segment length. The number of bits of the hash function is 64 bit, and we use some of the bits for the segment. Larger segments would be possible, but then a different hash function (with more bits) would be needed.

BaiqingL commented 2 years ago

Thanks, would that mean then there is a theoretical upper bound for the amount of elements this implementation would be able to support until we hit a maximum from populating too much information making it too "dense"?

lemire commented 2 years ago

The current implementation scales to billions of entries but it will eventually break down. How big is your input?

BaiqingL commented 2 years ago

A few million at most so this wouldn't be an impact :)

But just out of curiosity, would there be a way for it to support a potential infinite key space?

lemire commented 2 years ago

But just out of curiosity, would there be a way for it to support a potential infinite key space?

First, the current implementation stores 64-bit input keys. That would need to change to be an arbitrary number of bits. Second of all, we use a simple and fast hashing function. As @thomasmueller wrote... "Larger segments would be possible, but then a different hash function (with more bits) would be needed."

It is easy enough to produce a fast hash function that produces many, many bits... but it is best done by assuming that the underlying hardware has some cryptographic instructions (which is hard to do with generic C++ code). It is also easy enough to scale from 64 bits to higher number of bits...

All these things are relatively easy...

This being said, there are practical reasons for not wanting arbitrarily large arrays on actual systems. It is likely that even if you can afford the RAM, you are going to need to configure your system for very large pages otherwise your performance will be really bad (this is true of all filters, not just ours). And with NUMA.... well...

So in practice, you'd almost surely want to break down the filters into small filters.

It is true that you can spin up a server with massive disk space and massive memory, but nobody uses these beasts for one monolithic data structure.

lemire commented 2 years ago

(Well, some people do silly things like this, but they get bad performance. They hire consultants. These consultants tell them to redesign.)

BaiqingL commented 2 years ago

Thanks for the explanation, appreciate all the help!

7PintsOfCherryGarcia commented 10 months ago

Hi, didn't want to raise a new issue as this one may cover my question. I have a problem where I want to construct filters from very large sets of keys( > $2^{32}$ ). The published implementation sets the filter size to 32 bit integers. So my best option is to split my keys into multiple filters? Thanks for any help.

lemire commented 10 months ago

@7PintsOfCherryGarcia

The published implementation sets the filter size to 32 bit integers.

That's sufficient to take in 32 gigabytes of data, producing a that using between 4 GB and 8 GB of storage.

We do not currently support larger filters... It is doable, but we would need use cases to motivate the work.

So my best option is to split my keys into multiple filters?

If you don't want to work on the implementation, then yes...

7PintsOfCherryGarcia commented 10 months ago

@lemire

We do not currently support larger filters... It is doable, but we would need use cases to motivate the work.

I am working on speeding short read alignment of DNA sequences. My intention is to use (bf)filters to quickly query if a given DNA sequence of length K (kmer) is present in set of kmers built from a large collection of sequences. Currently, for alignment of short DNA sequences (aka reads) (~100 DNA characters), a Burrows-Wheeler transform is constructed from the "reference" sequence one wants to align against. Afterwards, the positions where a DNA read might align in the reference sequence are computed using the BWT (because the short DNA sequence is a substring(almost) of the reference sequence).

In my specific case (environmental ancient DNA sequences), we want to use the entire collection of published genomes(~10 Tbps, $10\times10^{12}$ characters) as our "reference" sequence. In this case, most DNA reads do not align to most reference genomes. Querying a kmer from a read using bffs is much faster that trying(and failing) to find substring positions with BWTs. Thus, the vast majority of reads can be prefiltered before alignment.

If you don't want to work on the implementation, then yes...

I naively tried to adapt the published code to be able to build large filters by changing the necessary types to 64 bit integers. That did not work :). For now I will be splitting my data into several filters, although I will give it a try again to be able to have larger filters.

lemire commented 10 months ago

That did not work :)

How many keys do you have?

My experience suggests that having just one enormous data structure is an engineering anti-pattern.

Breaking down the problem into partitions has several benefits. An obvious one is that construction is trivially parallelized… But it is much easier to store and shipped modest data structures. Debugging is likely easier too.

7PintsOfCherryGarcia commented 10 months ago

How many keys do you have?

For every N DNA characters I genereate ~(N - K + 1) kmers (subsequences of length K, technically it is a bit less because of biology stuff), each kmer is my key. I am trying to generate filters for at least 10 billion kmers.

I have implemented a chunked filter construction. Works ok I guess but querying takes quite long.

Thanks a lot for the time and the help :)

lemire commented 10 months ago

Works ok I guess but querying takes quite long.

I think you query n filters?

       for (uint32_t j = 0; j < n; j++) {
            if ( binary_fuse32_contain( key, &filters[j]) ) {
                found++;
                break;
            }
        }

I recommend that you create some reasonable hash function that maps each entry into one of n possible values and do...

            if ( binary_fuse32_contain( key, &filters[special_hash(key)]) ) {
                found++;
                break;
            }

The special_hash function only needs to be fast and to roughly split your input into n sets of 'same size'.

lemire commented 10 months ago

In case my comment is not clear, I mean that for a given key, you should query a single filter.

FastFilter / xor_singleheader

SegmentLength max of 2^18 #39