Chaining hash table implementation

amari commented 5 years ago

Intrusive hash table implementation that supports:

collision resolution via chaining (Adapter<Link = LinkedListLink>)
run-time configurable load factor
~~bucket usage bitmap~~
Entry API
zero-alloc insertions
allocating/resizing insertions

There are a few utility traits:

Array for implementing the bucket index and the metadata.
DynamicArray: Array for implementing dynamic allocation / resizing.
~~Bits~~
~~BitsArray~~

Dynamic allocation is entirely optional and can be implemented via the DynamicArray trait.

Amanieu commented 5 years ago

Nice work! I'm quite busy this week and won't have time to perform a full review, however I will try to find some time next for it.

Amanieu commented 4 years ago

@amari Are you planning on continuing work on this? One of the major concerns (doubly vs singly linked list) is now resolved thanks to #46.

Amanieu commented 4 years ago

I would like to make one major change regarding the Array trait. I think we should only have a single Array trait, with these methods:

trait Array<T>: DerefMut<[T]> {
    /// Hint to the array that the hash table is about to insert additional elements.
    /// This function can return a new array to tell the hash table to rehash into the new array.
    fn reserve(&mut self, len: usize) -> Option<Self>;

    /// Hint to shrink the capacity of the array if possible.
    /// This function can return a new array to tell the hash table to rehash into the new array.
    fn shrink_to_fit(&mut self, len: usize) -> Option<Self>;
}

The key change here is that the Array is now responsible for tracking the load factor of the table and choosing when to rehash.

The implementations of Array for [T; N] will have no-op implementations for both of these methods: a fixed-size array will never re-hash and doesn't care about the load factor.

For dynamic arrays which track the load factor we will provide a DynamicArray<T> type (we can't just use Vec<T> or Box<[T]> because they don't have a load factor). This type will control the load factor and resize as needed.

amari commented 4 years ago

Yes, I plan on finishing this later this week.

amari commented 4 years ago

I've been ruminating about the design and came to the conclusion that it is much cleaner to treat HashTable as a "N+1 degree" data structure, whereas SinglyLinkedList would be a "N degree" data structure.

This means that HashTableAdapter does not implement Adapter because a HashTable never deals with link operations directly, and instead delegates to the bucket type.

I wanted to get the current design reviewed. Also, test coverage is 93.2%, so check the tests for example usage.

Outstanding issues

[ ] Create and document examples.
[x] HashOps implementations must be generic over their value argument, otherwise the compilation fails (the compiler claims that HashOps<A::KeyOps::Key> is not implemented for IntegerIdentityHashOps when A::KeyOps::Key = u32). This is due to the HRTB A::HashOps: for<'b> HashOps<<A::KeyOps as KeyOps<'b, <A::PointerOps as PointerOps>::Value>>::Key>. The solution is to remove the HRTB, but I am debating whether to add a lifetime parameter to the impl or to remove the lifetime from KeyOps.
[ ] A way to customize the "load factor target" through Array or HashTable. The array would still decide when to grow/shrink, if at all, but a use-case may require optimized time (more buckets, fewer elements per bucket, faster search within bucket) or optimized space.
[x] Extending BucketOps to support custom bucket emptiness tracking.
[ ] Consolidate the hash_table module into a single file.

Some notes

There are now two kinds of cursors: `BucketCursor` and `Cursor`.

Originally there was a single Cursor where move_next would advance to the next element within the bucket or the first element of the next non-empty bucket.

However the effects of move_next were dependent on the internal state of the hash table, making the meaning of this code ambiguous:

// ...
let mut cursor = hash_table.cursor_mut();
cursor.move_next();
cursor.insert_after(value);

Into which bucket is the value inserted? There's no way to tell without exposing the current index. Even that wouldn't enable the same degree of control offered by the cursor APIs of the other intrusive collections. move_next_bucket and move_prev_bucket would move the cursor to a hybrid "null" state where get returns None, but that now means that there is no longer a single "null object", but a multitude. What would is_null return if the cursor was pointing to the null object of a bucket? (E.g. When inserting a value into an empty bucket.)

`KeyOps` and `HashOps`

Since HashTableAdapter does not implement Adapter, we cannot implement KeyAdapter.

/// Key operations.
pub trait KeyOps<'a, T: ?Sized> {
    /// The key type.
    type Key;

    /// Returns the key of `value`.
    fn key(&self, value: &'a T) -> Self::Key;
}

/// Trait for hash operations.
pub trait HashOps<T: ?Sized> {
    /// Returns the hash value of `value`.
    fn hash(&self, value: &T) -> u64;
}

/// The default implementation of `HashOps`.
#[derive(Default)]
pub struct DefaultHashOps<S: BuildHasher> {
    hasher_builder: S,
}

impl<S: BuildHasher> DefaultHashOps<S> {
    /// Returns a new hash ops instance.
    #[inline]
    pub fn from_hasher_builder(hasher_builder: S) -> DefaultHashOps<S> {
        DefaultHashOps { hasher_builder }
    }
}

impl<S: BuildHasher, T: Hash> HashOps<T> for DefaultHashOps<S> {
    #[inline(never)]
    fn hash(&self, value: &T) -> u64 {
        let mut hasher = self.hasher_builder.build_hasher();
        value.hash(&mut hasher);
        hasher.finish()
    }
}

I need `cursor_mut_from_ptr(ptr).remove()` to have constant time complexity.

However, a singly linked list is sufficient for many use cases.

I've added a BucketOps trait for the following reasons:

Clean separation between the Adapter implementation of the bucket vs HashTableAdapter
User-specified time complexity (SinglyLinkedList vs LinkedList)
Custom emptiness tracking (e.g. bitset of non-empty buckets) (Planned but not yet implemented.)

Also, BucketOps::Cursor is not the same as a collection's Cursor. It is usually Option<NonNull<A::PointerOps::Value>>.

pub unsafe trait BucketOps {
    /// Collection-specific pointer conversions which allow an object to
    /// be inserted in an intrusive collection.
    type PointerOps: PointerOps;

    /// The hash table bucket type.
    type Bucket;

    /// The cursor over elements within a bucket.
    type Cursor: Clone;

    /// Returns `true` if the bucket is empty.
    unsafe fn is_empty(&self, bucket: &Self::Bucket) -> bool;

    /// Removes all elements from the bucket.
    ///
    /// This will unlink all objects currently in the bucket, which requires iterating through
    /// all elements in the bucket. Each element is converted back into an owned pointer and then dropped.
    unsafe fn clear(&mut self, bucket: &mut Self::Bucket);

    /// Empties the bucket without unlinking or freeing the objects within it.
    ///
    /// Since this does not unlink any objects, any attempts to link these objects into another intrusive collection will fail,
    /// but will not cause any memory unsafety.
    /// To unlink those objects manually, you must call the `force_unlink` function on them.
    unsafe fn fast_clear(&mut self, bucket: &mut Self::Bucket);

    /// Returns `true` if the cursor is pointing to the null object.
    unsafe fn is_null(&self, bucket: &Self::Bucket, cursor: &Self::Cursor) -> bool;

    /// Returns a reference to the object that the cursor is currently pointing to.
    ///
    /// This returns `None` if the cursor is pointing to the null object.
    unsafe fn get(
        &self,
        bucket: &Self::Bucket,
        cursor: &Self::Cursor,
    ) -> Option<&<Self::PointerOps as PointerOps>::Value>;

    /// Returns a null cursor for this bucket.
    unsafe fn cursor(&self, bucket: &Self::Bucket) -> Self::Cursor;

    /// Creates a cursor from a pointer to an element.
    ///
    /// # Safety
    /// `ptr` must be a pointer to an object that is a member of this bucket.
    unsafe fn cursor_from_ptr(
        &self,
        bucket: &Self::Bucket,
        ptr: *const <Self::PointerOps as PointerOps>::Value,
    ) -> Self::Cursor;

    /// Searches the bucket for the first element where `is_match` returns `true`.
    ///
    /// Returns `None` if no match was found, or the bucket was empty.
    unsafe fn find_prev<F>(&self, bucket: &Self::Bucket, is_match: F) -> Option<Self::Cursor>
    where
        for<'b> F: FnMut(&'b <Self::PointerOps as PointerOps>::Value) -> bool;

    /// Returns a cursor that points to the next element of the bucket.
    ///
    /// If `self` points to the null object,
    /// then it will return a cursor that points to the first element of the bucket.
    ///
    /// If `self` points to the last element of the bucket,
    /// then it will return a cursor that points to the null object.
    unsafe fn next(&self, bucket: &Self::Bucket, prev: &Self::Cursor) -> Self::Cursor;

    /// Inserts a new element into the bucket after the current one.
    ///
    /// If the cursor currently points to the null object,
    /// then the new element is inserted at the front of the bucket.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn insert_after(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    );

    /// Removes the next element from the bucket.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is not moved.
    ///
    /// If the cursor currently points to the last element of the bucket, then no element is removed  and `None` is returned.
    unsafe fn remove_next(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
    ) -> Option<<Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the current element from the bucket.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is advanced to the next element of the bucket.
    ///
    /// If the cursor currently points to the null object, then no element is removed  and `None` is returned.
    unsafe fn remove(
        &mut self,
        bucket: &mut Self::Bucket,
        current: &mut Self::Cursor,
    ) -> Option<<Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the next element from the bucket, and inserts another object in its place.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor is not moved.
    ///
    /// If the cursor currently points to the last element of the bucket, then `Err(value)` is returned.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn replace_next_with(
        &mut self,
        bucket: &mut Self::Bucket,
        prev: &Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    ) -> Result<<Self::PointerOps as PointerOps>::Pointer, <Self::PointerOps as PointerOps>::Pointer>;

    /// Removes the current element from the bucket, and inserts another object in its place.
    ///
    /// Returns a pointer to the element that was removed.
    /// The cursor points to the newly added element.
    ///
    /// If the cursor currently points to the null object, then `Err(value)` is returned.
    ///
    /// # Panics
    /// Panics if the new element is already linked to a different intrusive collection.
    unsafe fn replace_with(
        &mut self,
        bucket: &mut Self::Bucket,
        current: &mut Self::Cursor,
        value: <Self::PointerOps as PointerOps>::Pointer,
    ) -> Result<<Self::PointerOps as PointerOps>::Pointer, <Self::PointerOps as PointerOps>::Pointer>;
}

Amanieu / intrusive-rs