Callidon / bloom-filters

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
https://callidon.github.io/bloom-filters/
MIT License
378 stars 43 forks source link

Filters created with version 1.3.4 and exported to json do not work when imported to version 1.3.9 #49

Closed stbrody closed 2 years ago

stbrody commented 2 years ago

We use this library in as part of the Ceramic Network. We are publishing batches of work done on the network along with a JSON-serialized bloom filter to make it easier to look up what happened in that batch. Our hosted service making these batches and publishing these bloom filters seems to be using version 1.3.4 of the bloom-filters package. Importing those bloom filters with the current 1.3.9 version of the library causes the .has() method of the reconstituted bloom filter to always return false, even for entries that are definitely present in the filter. Reverting to version 1.3.4 and rebuilding the bloom filter from the same input data causes it to behave correctly. It seems that there was a format change somewhere between version 1.3.4 and version 1.3.9 that causes filters created on the old version, exported with saveAsJSON(), and then imported with fromJSON() on the new version to not build the bloom filter correctly.

Code that builds the filter: https://github.com/ceramicnetwork/ceramic-anchor-service/blob/9ff9e1a20e46c65036fcbed600a0abf1b09d0bec/src/merkle/merkle-objects.ts#L247-L249

stbrody commented 2 years ago

This is pretty severe and critical for us. We rely on the bloom filters we export to json to remain valid and usable for the forseeable future. If you are changing the internal representation of the filters in a way that makes them not backwards-compatible with the json serialization from old versions, please add a versioning system into the json serialization and continue to support old formats!!!

folkvir commented 2 years ago

Problem comes from fix #36, hashes are done in a different way and therefore backward compatibility is not maintained between the two versions. We should have majored the version. Sorry for the inconvenience.

@Callidon do you agree this analysis?

folkvir commented 2 years ago

Because of the hashing, getDistinctIndices could in your version cause infinite loop in some circonstance. Such behavior is removed since version 1.3.7.

As a replacement we could add a BloomFilterV136 class with all the old code in the class for backward compatibility. But such solution must be maintained and shared across all the new versions. Plus you have the bug risks. Plus, since getDistinctIndices and double hashing has changed in the next versions, all the structures using it will have the same problem than you encounter and therefore we must share *VXX classes for any breaking changes. We surely won't do that.

For the moment I just recommend to use the 1.3.4 in your situation. If you really want to use the next versions then you will have to recreate all your filters from start.

I will add a note on the README for compatibility problem between versions before releasing the 2.0.0.

stbrody commented 2 years ago

For the moment I just recommend to use the 1.3.4 in your situation

Then we're stuck on 1.3.4 forever. We want to upgrade to the new version and be able to take advantage of the bug fixes and perf improvements you all are making. We make new filters several times a day and would be happy for the newly created filters to always be on the most up-to-date version with all the latest fixes and improvements. The problem is we can't lose the ability to load the previously created versions. These bloom filters are being written to a blockchain, so the history is immutable - there's no way to recreate and upgrade old filters after they are made and exported.

The ideal solution I'd like to see is to add a _version field to the JSON that is exported when calling filter.saveAsJSON() (alongside the _filter, _length, _seed, etc fields) that is updated every time there is a breaking format change. Then when loading a filter with BloomFilter.fromJSON, it would automatically load the correct implementation to interpret that version. Yes this would mean keeping support for old versions around in the code base, but given they would never change I would expect the maintenance costs would be very low to nonexistent.

I'd be happy to put together a PR to do this if this approach is acceptable to you all.

Callidon commented 2 years ago

Hi 👋

After discussion with @folkvir, we've agreed that it was a mistake on our side to tag the changes to the hash algorithm as a minor version. It should have been tagged as a major version, due to the breaking changes regarding JSON serialization. I apologize for the error, and the inconvenience following it. Since the next release is going to be a major one, it is going to resolve itself on this matter.

However, I'm not in favor of implementing a backward compatibility, because the changes that we made to the hashing algorithm are important bug fixes that prevented some filters to properly work. So, while I agree that your case is problematic, I don't want to support a version of the library that use an incorrect implementation.

Therefore, and I'm sorry for that, I strongly disagree with your proposed solution. You can implement it in your application if you wish, and that's may be a good idea, but I will be against implementing it in the bloom-filters library. However, if you have another idea to resolve that issue, feel free to propose it!

stbrody commented 2 years ago

Okay, we will work around this on our side then.

In the meantime, would you mind doing a 1.3.10 release that restores compatibility of importing data created with version 1.3.4 before you do the 2.0.0 release with the bug fix and breaking format change? So that way the old major version sequence continues to be backwards compatible and we don't risk upgrading to a version that breaks compatibility?

folkvir commented 2 years ago

Let me try something, then I'll ask you to try the solution on your side. If it is working we will discuss of a possible integration with @Callidon.

folkvir commented 2 years ago

@stbrody Back, I pushed something in the develop branch. The package will be the 2.0.0 with or without the fix. As I said we will discuss it with @Callidon to integrate it or not. Meanwhile, you can try the develop branch.

The new BloomFilter integrate a new structure to store the data with a Bit Set.

See (https://github.com/Callidon/bloom-filters/blob/develop/test/compatibility-test.js) for a complete compatibility test.

stbrody commented 2 years ago

Hmm... @folkvir, if I'm understanding this right, it looks like I still need to know whether or not my filter was created with the old version or not so I know whether or not to switch to the old implementation of getIndexes. If the library doesn't handle version detection and compatibility internally, then that means my application still needs to be responsible for keeping track of which version of the library was used every time a filter is created, and then at read-time select the appropriate implementation to use based on the version that was used when the filter was first created.

Having this fix you added does make things a little easier to work with, as I can switch filter versions on the fly without having to re-install the package which is nice, but the bigger problem is having to do the version detection and handling of old versions ourself.

folkvir commented 2 years ago

Yes exactly you need to make the versionning yourself. Here the problem is that changes made on how indexes are generated cannot be backward compatible. It does not affect only the BloomFilter but every single class using the old getDistinctIndices function in version <1.3.7. Plus, We have no clue of the version used by people in the exported filter. So the only way for people to continue to use the old system before 1.3.7 is to manually add information about the package version when the filter was created and then must use the _deprecated_getDistinctIndices instead of _getIndexes just after the import.

@Callidon and me disagree to include a versionning system because of this lack of clues and because of the futur code proliferation. But we are ok with a pseudo-backward compatibility of the BloomFilter import because of the optimizations made on the internal storage in the develop branch. Thus, with the old code added in the BaseFilter classe it should be enough for people to use the old index generation while using the new internal structure.

stbrody commented 2 years ago

I wonder if a reasonable compromise would be to include the version used in the json metadata when exporting a filter, but not necessarily commit to automatically supporting all previous versions. But you could at least throw a useful error message when trying to re-create a filter from a version that is no longer supported, rather than just silently returning incorrect data.

That error could then be a single to a dev to swap in the _deprecated_getDistinctIndices

Callidon commented 2 years ago

👋 Hi @stbrody We've finally released the 2.0.0 version of the package. You should now be able to have backward compatibility for import/export. I invite you to check out the new version and more precisely this test case that should help you here. Let me know if it improves the issue in any how.

stbrody commented 2 years ago

Thanks for letting me know @Callidon. We've decided though to write down the exact package version number alongside every bloom filter we create so when we need to read them later we can always use the exact version they were created with. I don't think the approach implemented here really helps us as you still need to know what version the filter was created with so that you know whether or not to install the old index function. If we already have to do the work to keep track of what version was used to create the filter and change behavior based on that version, then we might as well go all the way and just use the exact version that matches how it was created.

Callidon commented 2 years ago

Okay, thanks for your feedback. I'm closing the issue, since we have both fixed it on your sides. Feel free to re-open it if you find any more issues.