Fix fill factor, error tightening and growing ratio.

arthurprs commented 3 years ago

Hello, I was using the crate in a project so I spotted and addressed a few shortcomings.

Problem: BitVec Serialization format is sub-optimal, which might not be good when a bloom filter is involved. Solution: Use a simple Vec and serialize as bytes.

Problem: Empty filter allocates Solution: Don't eagerly add a filter on new()

Problem: Grow and error tightening factor wasn't clear. It was sort-of adding 1 slice per growth? Solution: Use a 2x growth factor and 0.8 tightening factor, which gives a ~5x upper bound on fp ratio even after 1000x growth.

Problem: Arbitrary hashing strategy Solution: Use enhanced-double-hashing strategy which is both simpler and mathematically correct.

Problem: Bit fill ratio of 0.8 results in a completely off false positive ratio. BFs are built around a 0.5 fill factor. In addition, that's not practical performance wise. Solution: Use a insert counter to estimate fill factor, which is sort-of the standard for performant implementations.

Not correctness related:

Switched to ahash hasher, which became the de-facto performance focused hasher in the ecosystem.

dpbriggs commented 3 years ago

Hey! Thank you for the pull request, this is awesome.

I'll be at work for the next ~8 hours, and then I'll be able to review properly. First a few ideas:

This will need to be a major version bump; could you change the version to 2.0.0? And add yourself as an author on the library or somewhere in the README, if you want
Changing the struct layout will likely break existing serde uses of the library. We should add some documentation near the top of the README. If you can modify the org-mode that would be best, then export. Otherwise I could do that, then export to markdown, then remove the org mode file
Would it be desirable to allow users to choose the seed?
Would it be desirable to allow users to choose the hashing backend/strategy? (default to double hashing; trait + empty struct?)
And would you mind adding some more documentation around the TIGHTENING_RATIO? Was that also referenced in the paper?
I don't mind changing the grow logic to be based on the true number of inserts, but when users insert a lot of items that hash similarly we may start growing the filter unnecessarily according to the false positive rate. Do you know if this strategy is popular with other bloom filters? I pulled the fill_ratio_gte from the paper

arthurprs commented 3 years ago

* This will need to be a major version bump; could you change the version to 2.0.0? And add yourself as an author on the library or somewhere in the README, if you want

Sure!

* Changing the struct layout will likely break existing serde uses of the library. We should add some documentation near the top of the README. If you can modify the org-mode that would be best, then export. Otherwise I could do that, then export to markdown, then remove the org mode file
I'm not really following, so it's best if you do it?
* Would it be desirable to allow users to choose the seed?
That's probably not so useful. Unless someone wants filters created with the same elements to be different for some reason.
* Would it be desirable to allow users to choose the hashing backend/strategy? (default to double hashing; trait + empty struct?)
I think these are doable, just trying to keep the scope a bit more limited so I selected something that should be fine for the vast majority of cases. We can optimize the serde output by not outputting the config values if they are the same as the default.
* And would you mind adding some more documentation around the `TIGHTENING_RATIO`? Was that also referenced in the paper?
Yes, that's r in the paper.
* I don't mind changing the grow logic to be based on the true number of inserts, but when users insert a lot of items that hash similarly we may start growing the filter unnecessarily according to the false positive rate. Do you know if this strategy is popular with other bloom filters? I pulled the `fill_ratio_gte` from the paper

That won't happen because we're checking if it exists before inserting. Theoretically we you could check all the subfilters except for the last and then blind insert into the last one, but ultimately that's an implementation choice.

arthurprs commented 3 years ago

I think I addressed all your comments with the last couple comments. But I realized that there're some problems with regards to determinism across platforms (eg. endianness changes and 32/64 changes), which is not a good thing when the filter is supposed to be (de)serializable.

I'll address that now.

dpbriggs commented 3 years ago

@arthurprs Published 2.0.0 :rocket:

Thank you so much for your incredible contribution! I really appreciated your patience, and it was neat going through my first real pull request. You have a great one!

dpbriggs / growable-bloom-filters

Fix fill factor, error tightening and growing ratio. #3