Cydhra / vers

very efficient rank and select
Apache License 2.0
56 stars 3 forks source link

`indexset` #3

Closed brurucy closed 4 months ago

brurucy commented 4 months ago

Hey there!

Thanks a lot for the PRs that made range iteration work better.

How does indexset's rank and select fare against what you have here?

Cydhra commented 4 months ago

While the operations are called the same and do a similar operation in theory, I specifically only implemented and tested rank/select data structures on bit vectors.

Those can be heavily optimized by exploiting bitwise operations on the x86 architecture and exploiting the fact that only ones and zeros are part of the data structure. So comparing this library would be pretty pointless, as indexset is a non-duplicate set (I think?), and supports all types of ordered elements.

The rank of an element in indexset is its position in the sorted set, while the rank of a bit in a bit vector is the number of equal bits before that element.

There might be a way to implement a bit vector using your fenwick tree, but I never worked with those, so I am not sure.

brurucy commented 4 months ago

Hmm.

In this case, would it not be possible to generalise the indexset container/Node to support that?

I had this idea for a while. To make indexset generic on the Node.

Cydhra commented 4 months ago

Maybe? I don't exactly know how deep the exclusivity of equal elements is rooted in the datastructure and if it could support unsorted sets.

Keep in mind though, that the rank implementation of BitVectors takes an index as parameter and returns the amount of 1 before that index for rank1 and the amount of zeros before that index for rank0, regardless of what element is at that index. So there are possibly multiple indices with the same rankX.

brurucy commented 4 months ago

If we look at Node:

struct Node<T>
where
    T: PartialOrd + Clone,
{
    pub inner: Vec<T>,
    pub max: Option<T>,
    pub iterations: usize,
}

impl<T: PartialOrd + Clone> PartialOrd for Node<T> {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        self.max.partial_cmp(&other.max)
    }
}

impl<T: PartialOrd + Clone> Default for Node<T> {
    fn default() -> Self {
        Self {
            inner: Vec::with_capacity(DEFAULT_INNER_SIZE),
            max: None,
            iterations: 10,
        }
    }
}

fn search<T: PartialOrd>(haystack: &[T], needle: &T, iterations: usize) -> Result<usize, usize> {
    let mut left = 0;
    let mut right = haystack.len();
    for _ in 0..iterations {
        if left >= right {
            break;
        }

        let mid = left + (right - left) / 2;

        let mid_value = unsafe { haystack.get_unchecked(mid) };

        if mid_value < needle {
            left = mid + 1;
        } else if mid_value > needle {
            right = mid;
        } else {
            return Ok(mid);
        }
    }

    Err(left)
}

impl<T: Ord + Clone> Node<T> {
    pub fn new(capacity: usize) -> Self {
        Self {
            inner: Vec::with_capacity(capacity),
            iterations: capacity.ilog2() as usize,
            ..Default::default()
        }
    }
    pub fn get(&self, index: usize) -> Option<&T> {
        self.inner.get(index)
    }
    pub fn split_off(&mut self, cutoff: usize) -> Self {
        let latter_inner = self.inner.split_off(cutoff);

        self.max = self.inner.last().cloned();

        let latter_inner_max = latter_inner.last().cloned();
        Self {
            inner: latter_inner,
            max: latter_inner_max,
            iterations: self.iterations,
        }
    }
    pub fn halve(&mut self) -> Self {
        self.split_off(DEFAULT_CUTOFF)
    }
    pub fn len(&self) -> usize {
        self.inner.len()
    }
    pub fn insert(&mut self, value: T) -> bool {
        match search(&self.inner, &value, self.iterations) {
            Ok(_) => return false,
            Err(idx) => {
                let some_value = Some(&value);
                if some_value > self.max.as_ref() {
                    self.max = some_value.cloned()
                }

                self.inner.insert(idx, value);
            }
        }

        true
    }
    pub fn delete(&mut self, index: usize) -> T {
        self.inner.remove(index)
    }
}

It could be a trait requiring delete, insert, get and halve.

Cydhra commented 4 months ago

But wouldn't functions like search get extremely slow if you gave up sorted order within nodes?