Callidon / bloom-filters

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
https://callidon.github.io/bloom-filters/
MIT License
369 stars 42 forks source link

Import/export from bytes #31

Closed dholms closed 3 years ago

dholms commented 3 years ago

Thank you for this library!

In our use case, we need to export & store bloom filters as byte arrays as well as re-import them as bloom filters to add additional elements.

I've added functions/tests for doing both.

This does change the API of the constructor, as it takes the filter as an array of bits instead of a size. However, I added an additional static function empty that replicates the functionality of the existing constructor.

I also had to remove the _length check from equals since if we import from a byte array, we don't know how many objects are in a bloom filter.

We're currently only using the bloom-filter implementation, so I only made the changes in that location. However, if you are interested, I can make the changes to the rest of the implementations so that the APIs match. Let me know! :grin:

folkvir commented 3 years ago

Hi, thank you for the PR it's interesting but for convenience you should not modify the original constructor. Otherwise you create breaking changes and it will need a new major version. Secondly, we want to keep the original parameters of each filter to be in the original constructor of its associated filter. @Callidon Do you agree? I request that specifics import/exports methods need to be created/modified for acceptance.

dholms commented 3 years ago

I would also like to avoid adjusting the constructor.

With the current code, the proposed changes are impossible. I believe that we need to do one of the following 1) change the constructor as I proposed 2) add a setFilter method that overrides the _filter field. (or make that field public) 3) create a new class (PersistedBloomFilter for instance) that extends BloomFilter and takes a filter in the constructor.

Downsides of each: 1) breaking changes, major version bump 2) leaky abstractions, internal state of filter can get out of sync 3) class proliferation, seems a bit like overkill

If you have another idea, please do let me know!

How do you feel about removing the _length field on the filter? From a data structure perspective, a bloom filter does not track the number of items within it.


Right now, the package only works if the filter is kept in memory and never stored anywhere. These changes allow the filter to be stored as a byte array and then reloaded into memory. Perhaps that added feature, which is actually fairly substantial for what it allows, is worth a major version bump?

Let me know your thoughts. I am happy to contribute & get these changes in. But if not, we can maintain the fork for our purposes. Thanks :slightly_smiling_face:

Callidon commented 3 years ago

Hi,

Thanks, @dholms for your contribution. It's an interesting feature.

However, I share the same opinion as @folkvir on this PR. You should modify the original constructor because peoples already using the framework will experience a major breaking change when upgrading. And since we now have a significant user base, we need to be very careful with these changes.

Additionally, I really don't like the changes you propose to the default constructor. For me, a default constructor is here to create an empty data structure, that is later populated by the application. It is how most probabilistic data structures are designed. With your changes, the constructor creates a filter from an array of numbers, which is not the standard way of creating a Bloom Filter. For this exact purpose, we created import methods for various classes across the package.

Right now, the package only works if the filter is kept in memory and never stored anywhere.

This is incorrect because the package provides a way for exporting/importing all filters to JSON. And several users rely on this feature since it was added following a feature request from our users.

I don't understand why you need these changes to implement an export/import feature. Please, can you elaborate on that point?

dholms commented 3 years ago

Hi @Callidon thanks for the response!

Understood re: breaking changes. Certainly good to avoid.

It appears that to/from JSON is not implemented for bloom filters yet. But after looking at the implementation for other filter types, I realized that private variables can be changed within a static method, which I hadn't realized :sweat_smile:

That means we can add the fromBytes static method without changing the constructor. I've pushed the changes that demonstrate this.

The only change this still requires is we have to remove length from the equals check. This library originally did not track length, and this was added in the last few months. Personally I think it is okay to not include it on an equals check, but I'm interested to hear your perspective.

A bloom filter does not track the number of elements with in it (altho it can be helpful information to track alongside the filter itself, so I understand why a library might do that). But two bloom filters that have the same bits set, with a different number of elements are technically "equal".

A bit about our use case: we pass around bloom filters as base64 strings. These filters (or their hash) serve as "names" for objects in our system. So we need to be able to go to & from bytes, and we don't want to serialize additional information (such as the number of elements in the filter) into that bytestring.

Let me know if that makes sense & if these changes are interesting to you now that we do not have to change the filter constructor.

folkvir commented 3 years ago

Tracking of length implemented 4 years ago: https://github.com/Callidon/bloom-filters/commit/2f74a5c3ef1e391ef6d05c2abb0b88401d6ccf25 not added a few month ago.

And for import/export methods it is working as expected and tested.

const { BloomFilter } = require('bloom-filters')

const bf = BloomFilter.create(10, 0.01)
bf.add('meow')

console.log(bf)
const json = bf.saveAsJSON()
console.log(json)
const bf2 = BloomFilter.fromJSON(json)
console.log(bf2)
console.log(bf.equals(bf2))
Callidon commented 3 years ago

Thanks, @dholms for your feedback.

The only change this still requires is we have to remove length from the equals check. This library originally did not track length, and this was added in the last few months. Personally I think it is okay to not include it on an equals check, but I'm interested to hear your perspective.

The decision behind the length check is the following: since Bloom Filters are append-only data structures, we can assume that if we track how many items are added to a filter, then we can quickly compare some bloom filters without comparing their content. On large filters, this can save a huge amount of time.

After looking at your code, it's come to me that what you are proposing is not really an export of Bloom Filters in bytes format, but rather an export of their content in bytes format. If you were exporting the complete bloom filter, you can embed information about its length and its number of hash functions in the payload, using a custom format similar to B-encode. Indeed, in your fromBytes function, you need to know the number of hash functions using a parameter, so it's not a pure export of the filter.

For me, I'm okay with the proposed changes, but with the following modification:

This way, you will first need to instantiate an empty Bloom Filter, then load its content from a bytes export. It does not change how you use the functionality as currently implemented in the merge request, since you need to know the size of the filter and the number of has functions anyways.

I've also a little remark on the test implemented: you only put the actual test in the it block, but if there is an error when executing the export/import methods, an exception will occur in the describe block. I don't think mocha will appreciate this situation and the error reporting might be odd.

Sorry if it appears a bit picky, but I want to keep the API clear, concise and avoid breaking changes.

What do you think about it? @dholms @folkvir

matheus23 commented 3 years ago

Hello @Callidon and @folkvir, collague of @dholms here.

Thanks for looking at this PR and having a conversation about the API.

I think there's some discrepancy between the intended use case for this library and the way we use bloom filters.

For the use cases you intended bloom-filters, performance is of greatest importance. Hence having the equality check short-circuit if the amount of added items doesn't match is desirable.

For us, we're using bloom filters in a context where privacy is the highest priority. We want to leak as little information about what was put into the bloom filters as possible. Storing the amount of elements that were put into the bloom-filter would leak more information that we'd be comfortable sharing.

Apart from that, we're implementing a specification which only specifies storing the bloom filter's byte contents, but not the amount of elements added. When we try to import bloom filters from a source we have no control over, and which doesn't provide any information about the amount of elements in them, we just can't use the API this library is exposing right now.


I think at this point it makes most sense for us to just maintain a fork of your library with a couple of hacks to ignore the length parameter in bloom filters on equality checks. We'll most likely have to implement a solution which compromises equality-check performance in the case you're comparing bloom filters that have unequal amounts of elements added for a different API that doesn't require exporting and importing bloom filter element amounts.


I don't actually have the rights to close this PR, so feel free to do that. I hope that you're understanding where we're coming from, but if you have any more questions about our use case, I'm subscribed to this PR and happy to answer any. :)

folkvir commented 3 years ago

Well, @dholms @matheus23 thank you for the details. After reflection this should be enough for you to override the structure as you need.

/**
   * Set the length of the bloom filter
   * @param length the length to set
   */
  set length(length: number) {
    this._length = length
  }

  /**
   * Set the filter used by this Bloom filter
   * @param filter the filter to set of type number[]
   */
  set filter(filter: number[]) {
    this._filter = filter
  }

  /**
   * Getter of the filter
   * @return the filter used by this bloom filter
   */
  get filter(): number[] {
    return this._filter
  }

The rest is application specific and should not be inserted in the package. As pointed out in point 2) of @dholms in (https://github.com/Callidon/bloom-filters/pull/31#issuecomment-848073335) it allows for overriding in your application either the filter or the length, and therefore changing the filter to/from a bytes array and set the length attribute to any convenient value. There is no problem to add this because anyone can already change the value of the array or the length stored inside the json returned by the saveAsJson function before reimporting it.

The problem is larger than this, if you want a more fine grained behavior, the current visibility of the properties limits things you can do. And extending the BloomFilter class wont currently work. We will discuss it later with @Callidon. I'll open an issue to track this and will close the PR.