droundy / tinyset

Compact sets in rust
Apache License 2.0
53 stars 8 forks source link

Questions about new APIs #8

Closed ryzhyk closed 4 years ago

ryzhyk commented 4 years ago

Hi @droundy, apologies about lack of progress on the issue in the internment crate. Like you, I am overwhelmed these days :(

I have two questions about recent changes to this crate.

First, I have so far been happily using 0.2.2 and only tried upgrading to the latest version today and ran into an API regression: SetU64 now contains a raw pointer and is no longer Sync, which is a problem for my code. Would it make sense (and be safe) to add unsafe impl Sync for it?

Also, I notice big changes since 0.2 in terms of lines of code, so I'm just curious what prompted the changes. Are you aware of any correctness issues in 0.2? Or was this refactoring mostly about performance/code quality?

Thanks!

droundy commented 4 years ago

I'm a little vague on the requirements for sync. I imagine it's sound to do so, but would require some research time to confirm that before adding the impl. I'll get to the second question some time after snack time.

droundy commented 4 years ago

Okay, I've read up on sync and send and am confident that Set64 and friends can implement both.

droundy commented 4 years ago

Regarding the major rewrite, it was originally motivated by finding a bug in the old implementation, and realizing that that implementation was so complicated that it would be hard to ensure that there were no similar bugs.

The new implementation is simpler in that there are fewer different cases. It also uses 1/3 the size (as I vaguely recall) for empty or small sets. It is generally as fast or more often faster. The memory use is a bit harder to predict, as it now optimizes for numbers being close together more than for them simply being small. So there are cases where heap use increases, but usually it'll be about the same or significantly decrease.

We now use bit vectors for a sufficiently dense sets, and use hash maps to small bit vectors in the normal (large set) case. This means that if you have sequential sequences in your set then you'll get much better memory use.

The raw pointer allows us to use pointer tagging to keep the minimum size down to a single pointer, while still enabling quite a few small sets to be stored in just that one word. If todayt were clever enough with enum optimization this could possibly be done in safe rust, but it's not.

ryzhyk commented 4 years ago

Thanks for the detailed explanation. 3x improvement is extremely impressive, given how space efficient (in my experience) the old implementation already was.

droundy commented 4 years ago

I think this is closed now.