explosion / preshed

💥 Cython hash tables that assume keys are pre-hashed
MIT License
82 stars 19 forks source link

Fix bloom size issues with Windows serialization and in general #38

Closed polm closed 1 year ago

polm commented 1 year ago

Before this commit, serialization used the array module, but the size of units in that varies depending on the host OS. As a result data serialized on Windows would use 32-bit units instead of 64-bit ones.

This commit uses struct for serialization instead of array, eliminating the ambiguity issue. Additionally, it detects if the serialized data used the Windows size, and loads it in that size if necessary, so old data will still work. (Even if the data was serialized into smaller containers on Windows no data was lost due to a separate issue, which will be fixed in a further commit.)

Note that this issue only affected the serialization code, and in memory containers were consistent on platforms already.

This is a draft until the further bloom size issues are addressed.

polm commented 1 year ago

I believe Windows is working now, though I'm leaving this a draft while I verify the result of serializing data on Windows.

Going through backwards compatability, one other issue came up - you could create a BloomFilter with size/hash function count params of 0. If the hash count is 0 add/contains checks do nothing. If size is 0 and hcount is not, then it will crash if you try to add anything (unsurprisingly). I added asserts to check for this.

I had one question come up while doing this - what's the right way to declare a const value in Cython? For some of the arithmetic I need the size of a container type in bits, so I'm assigning that at the top of the file.

polm commented 1 year ago

I verified the Windows data is correct on real Windows.

The test failures here are unexpected - it looks like there is a segfault only on 3.6 on Linux. I'll look into that.

polm commented 1 year ago

I am unable to reproduce the segfault locally, even with the exact Python version used in the tests (3.6.15). I am not sure what else could explain it, I'll see if I can figure something out.

polm commented 1 year ago

There's a few things I'd like feedback on, but this is functional and has no issues I'm aware of at this point.

polm commented 1 year ago

Tests were failing because the Linux image no longer has Python 3.6 available, so I have removed Python 3.6 from testing.

polm commented 1 year ago

Thanks for the feedback, it's good to have this reviewed! I think I have addressed all the points that have come up at this point. The details of the repacking are definitely complicated, so let me know if that's still not clear, whether from my explanation or in code.

polm commented 1 year ago

Thanks for the feedback! I think my most recent commits should 1. clarify what I meant by "signficant" 2. raise an exception instead of using an assert for version mismatch.

polm commented 1 year ago

I think it would definitely make sense to make this a minor release of preshed.