attractivechaos / khashl

Generic hash table library in C
19 stars 0 forks source link

Table of Contents

Introduction

Khashl is a single-header macro-based generic hash table library in C. It is an improved version of khash from klib and is one of the faster hash table implementations in C/C++. Klib also has a copy of khashl for historical reason. This repo provides more examples and better documentation.

Usage

Integer keys

Here is a small example for integer keys:

#include <stdio.h>
#include <stdint.h>
#include "khashl.h"
// Instantiate
KHASHL_MAP_INIT(KH_LOCAL, map32_t, map32, uint32_t, int, kh_hash_uint32, kh_eq_generic)

int main(void) {
    int absent;
    khint_t k;
    map32_t *h = map32_init();
    k = map32_put(h, 20, &absent); // get iterator to the new bucket
    kh_val(h, k) = 2; // set value
    k = map32_get(h, 30); // query the hash table
    if (k < kh_end(h)) printf("found key '30'\n");
    kh_foreach(h, k) { // iterate
        printf("h[%u]=%d\n", kh_key(h, k), kh_val(h, k));
    }
    map32_destroy(h); // free
    return 0;
}

To use khashl, you need to instantiate functions specific to your types with

KHASHL_MAP_INIT(scope, table_type, prefix, key_type, val_type, hash_func, eq_func)

where:

After instantiation, you will be able to use the following functions:

In khashl, a position is like an iterator. prefix_get() and prefix_put() return iterators. Khashl additionally provides the following macros:

String keys

It is important to note that khashl only keeps the pointers to strings. You are responsible for managing the memory allocated to the strings.

Here is an example for counting the number of distinct words on the commnand line:

// To run this program: `./this_prog abc bc abc a bc`
#include <stdio.h>
#include <string.h>
#include "khashl.h"
KHASHL_SET_INIT(KH_LOCAL, strmap_t, strmap, const char*, kh_hash_str, kh_eq_str)

int main(int argc, char *argv[])
{
    strmap_t *h;
    int i, absent;
    h = strmap_init();
    for (i = 1; i < argc; ++i)
        strmap_put(h, argv[i], &absent);
    printf("# of distinct words: %d\n", kh_size(h));
    strmap_destroy(h);
    return 0;
}

In this example, the string contents are already stored in the argv[] array. You don't need to worry about memory management. The following demonstrates how to insert string pointers and their contents into a hash table.

// To run this program: `echo a bc a cd bc|./this_prog`
#include <stdio.h>
#include <string.h>
#include "khashl.h"
KHASHL_MAP_INIT(KH_LOCAL, strmap_t, strmap, const char*, int, kh_hash_str, kh_eq_str)

int main(int argc, char *argv[])
{
    char s[4096]; // max string length: 4095 characters
    strmap_t *h;
    khint_t k;
    h = strmap_init();
    while (scanf("%s", s) > 0) {
        int absent;
        k = strmap_put(h, s, &absent);
        if (absent) kh_key(h, k) = strdup(s), kh_val(h, k) = 0;
        // else, the key is not touched; we do nothing
        ++kh_val(h, k);
    }
    printf("# of distinct words: %d\n", kh_size(h));
    // IMPORTANT: free memory allocated by strdup() above
    kh_foreach(h, k) {
        printf("%s: %d\n", kh_key(h, k), kh_val(h, k));
        free((char*)kh_key(h, k));
    }
    strmap_destroy(h);
    return 0;
}

Custom keys

You can put C struct into a hash table as long as you provide a hash function and an equality function. You can use macro functions.

Algorithm

Khashl uses linear probing and power-of-2 capacity. It applies Fibonacci hashing to protect against bad hash functions and implements deletion without tombstones. Khashl uses one bit per bucket to indicate whether a bucket is empty. It has minimal memory overhead though this comes at the cost of one extra cache miss per query. Khashl does not use SIMD.

Ensemble of hash tables

Khashl uses 32-bit hashes, which means it cannot directly store more than 4 billion keys. Nonetheless, it has a special way to handle billions of keys: ensemble of hash tables.

Rationale

Suppose a hash table consists of n smaller sub hash tables. A key x is located in sub-table hash(x) % n. Because it is rare for all sub-tables to rehash at the same time, the peak memory can be reduced. You can find more explanation in this blog. In my opinion, using an ensemble of hash tables it the best strategy for huge hash tables.

We can implement a hash table ensemble in the user space for any libraries. I have been using the idea since 2015. Nonetheless, it is more convenient to hide the details behind the library code such that users can use familiar hash table APIs. phmap is perhaps the first library to do this. Now khashl has this functionality as well.

Use ensemble

The integer example above becomes:

#include <stdio.h>
#include <stdint.h>
#include "khashl.h"
// use "KHASHE" for instantiation
KHASHE_MAP_INIT(KH_LOCAL, map32_t, map32, uint32_t, int, kh_hash_uint32, kh_eq_generic)
int main(void) {
    int absent;
    kh_ensitr_t k; // use kh_ensitr_t instead of khint_t
    map32_t *h = map32_init(6); // use 2**6=64 sub hash tables
    k = map32_put(h, 20, &absent); // get iterator to the new bucket
    kh_ens_val(h, k) = 2; // use kh_ens_val() instead of kh_val()
    k = map32_get(h, 30); // query the hash table
    if (!kh_ens_is_end(k)) printf("found key '30'\n"); // use kh_ens_is_end()
    kh_ens_foreach(h, k) { // use kh_ens_foreach() instead of kh_foreach()
        printf("h[%u]=%d\n", kh_ens_key(h, k), kh_ens_val(h, k));
    }
    map32_destroy(h);
    return 0;
}

You will have to change most macros and iteration:

Performance

For details, see udb3.