RagnarGrootKoerkamp / PTRHash

PTRHash minimal perfect hash function, based of PTHash
33 stars 6 forks source link

Usage as Hasher in std::collections::HashMap? #7

Closed phip1611 closed 1 day ago

phip1611 commented 6 months ago

Hi there! I've come here because I found its usage in [0]. I'm not an expert in hashing and hashmaps. I'm wondering if this perfect hashing algorithm can be combined with std::collections::HashMap or std::collections::BTreeMap?

My "request"/suggestion is: Can you please add a few sentences in the README and documentation that classify this library against the standard library and the hashing implementations used there, including an example? This will help to understand people outside the loop what this actually is. :)


In the past days, I've jumped into the 1BRC challenge w/ Rust myself [1], that's how I ended up here (jup, I'm late to the party).

[0] https://github.com/RagnarGrootKoerkamp/1brc/blob/master/src/main.rs#L390 [1] https://github.com/phip1611/1brc-rust

RagnarGrootKoerkamp commented 6 months ago

Hey,

Perfect hashes don't really make sense to be used as hashers for a standard hashmap. Unlike normal hashes, constructing a perfect hash function requires knowing up front which keys you'll insert. And given a perfect hash function, there is typically no need for the hashmap to store keys. But that makes it impossible to detect duplicate or unexpected keys. So the typical case is to simply take the integer returned by the minimal perfect hash function and then to store an array that is directly indexed by this integer.

Hope this helps! I'll leave this open to update the readme later.

phip1611 commented 6 months ago

Got it, thanks!

And given a perfect hash function, there is typically no need for the hashmap to store keys.

I think I figured out how to use it. Is this correct?

use ptr_hash::{PtrHash, PtrHashParams};

#[derive(Debug, Default, Clone)]
struct AggregatedData {
    name: String,
    sum_values: f32,
}

impl AggregatedData {
    fn init(&mut self, name: String)  {
        self.name = name;
    }
}

fn main() {
    let cities = ["Berlin", "Dresden", "Hamburg"];

    // Build the data structure.
    let mphf = <PtrHash<_>>::new(cities.as_slice(), PtrHashParams::default());

    // Map-like vector. Indices map to elements.
    let mut map = vec![AggregatedData::default(); cities.len()];

    // Init all entries with data that is already known.
    for city in cities {
        let index = mphf.index_minimal(&city);
        map[index].init(city.to_string());
    }

    let index = mphf.index_minimal(&"Berlin");
    map[index].sum_values += 5.0;

    let index = mphf.index_minimal(&"Dresden");
    map[index].sum_values += 1.0;

    let index = mphf.index_minimal(&"Hamburg");
    map[index].sum_values += -5.0;

    let index = mphf.index_minimal(&"Hamburg");
    map[index].sum_values += -2.0;

    dbg!(map);
}
RagnarGrootKoerkamp commented 6 months ago

Yes, that looks like how I'm using it. Just note that you could get away with not storing the name of the city in the AggregateData struct.