efficient / cuckoofilter

Other
964 stars 167 forks source link

APIs to serialize/deserialize data? #28

Open anhldbk opened 6 years ago

anhldbk commented 6 years ago

Thanks for this nice implementation of Cuckoo filter. Would you please let me know how to serialize/deserialize the filter's state?

nextgens commented 6 years ago

I guess it's a feature request more than anything else... but I concur; it'd be super-convenient if it was part of the API

dnbaker commented 6 years ago

I used some dummy structs and manual memcpy to do so in a project a while back. It's gross and bad form but works. Just make sure the layout's the same. (Until it's actually supported, of course.)

nextgens commented 6 years ago

Would you mind sharing the code ?

dnbaker commented 6 years ago

I can pull that out of a private repo. I’ll look it over tomorrow and see if a pull request is easier.

If you’re doing sequence analysis, which I’m guessing based on your name, I could release the kmer database code around it if it meets your needs.

nextgens commented 6 years ago

Thanks!

No, this time it's not for sequence analysis :)

roniemartinez commented 6 years ago

👍 to this! Serializing/deserializing seems more efficient than rebuilding the filter. @dnbaker hope it is possible to share the code.

d4tocchini commented 5 years ago

@dnbaker, @efficient

this is viable, yes? if so, something to point me to?

d4tocchini commented 5 years ago

assuming a surgically placed memcpy, but not feeling confident whereabout

amiller-gh commented 2 years ago

@dnbaker resurrecting an old thread :)

I'm also interested in serialization. Any chance you can dig up that old code, or do you have recommendations for another cuckoo filter library?

dnbaker commented 2 years ago

Here's the code I had. It's only valid for POD (std::is_trivial and std::is_standard_layout). Sorry I took a while to respond - I had to go find and extract it.

First, there's a couple of dummy structs which are defined, and namespacing:

namespace {
typedef struct {
    size_t index;
    uint32_t tag;
    bool used;
} SerialVictim;
struct dummytable {
    void *p;
    size_t n;
};
struct dummyfilter {
    void *p;
    size_t n;
    SerialVictim vc;
};
}

using namespace cuckoofilter;

Writing to disk:

size_t write_cuckoo_filter(const CuckooFilter<uint64_t, bits_per_item, SingleTable> *filter,
                           const char *path) {
    dummyfilter dfilter;
    dummytable tbl;
    static const size_t tags_per_bucket(4);
    const size_t bktsz((bits_per_item * tags_per_bucket + 7) >> 3); // I haz a bkt.
    static_assert(sizeof(dfilter) == sizeof(*filter), "Dummy table is the wrong size!\n");
    static_assert(sizeof(tbl) == sizeof(SingleTable<bits_per_item>), "Dummy table is the wrong size!\n");
    // Copy info to dummy structs.
    memcpy(&dfilter, filter, sizeof(*filter));
    memcpy(&tbl, dfilter.p, sizeof(tbl));

    FILE *fp(fopen(path, "wb"));
    const int used(dfilter.vc.used);
    size_t ret(fwrite(&dfilter.n, 1, sizeof(dfilter.n), fp));
    ret += fwrite(&dfilter.vc.index, 1, sizeof(dfilter.vc.index), fp);
    ret += fwrite(&dfilter.vc.tag, 1, sizeof(dfilter.vc.tag), fp);
    ret += fwrite(&used, 1, sizeof(used), fp);
    ret += fwrite(&tbl.n, 1, sizeof(tbl.n), fp);
    ret += fwrite(tbl.p, 1, bktsz * tbl.n, fp);
    fclose(fp);
    return ret;
}

Reading:

template <size_t bits_per_item>
CuckooFilter<uint64_t, bits_per_item, SingleTable> *
load_cuckoo_filter(const char *path) {
    FILE *fp(fopen(path, "rb"));
    const size_t tags_per_bucket(4uL);
    const size_t bktsz((bits_per_item * tags_per_bucket + 7) >> 3); // I haz a bkt.
    fseek(fp, 0, SEEK_END);
    const size_t fsize = ftell(fp);
    fseek(fp, 0, SEEK_SET);
    size_t num_bktz, num_items, index;
    uint32_t tag;
    int used;
    fscanf(fp, "%llu", &num_items);
    fscanf(fp, "%llu", &index);
    fscanf(fp, "%u", &tag);
    fscanf(fp, "%i", &used);
    fscanf(fp, "%llu", &num_bktz);
    size_t num_bktz_found((fsize - (sizeof(size_t) * 3) - sizeof(uint32_t) - sizeof(int)) / bktsz);
    if((fsize - (sizeof(size_t) * 3) - sizeof(uint32_t) - sizeof(int)) % bktsz) {
       fprintf(stderr, "Found %" PRIu64 " buckets. Expected %" PRIu64 ". fsize: %" PRIu64 ". bktsz: %" PRIu64 ". They be stealin' my bkt :'(....\n",
               num_bktz_found, num_bktz, fsize, bktsz);
       exit(EXIT_FAILURE);
    }
    void *buf = (uint8_t *)malloc(fsize - 1);
    size_t nread;
    if((nread = fread(buf, 1, num_bktz * bktsz, fp)) != num_bktz * bktsz) {
       fprintf(stderr, "Couldn't read buffer from file '%s' Bytes read: %" PRIu64 ". Expected: %" PRIu64 ".\n", path, nread, num_bktz * bktsz);
       exit(EXIT_FAILURE);
    }
    dummytable *table((dummytable *)malloc(sizeof(SingleTable<bits_per_item>)));
    *table = {buf, num_bktz};
    static_assert(sizeof(*table) == sizeof(uint64_t[2]), "Wrong size!\n");

    dummyfilter *ret((dummyfilter *)malloc(sizeof(dummyfilter)));
    *ret = {table, num_items, {index, tag, !!used}};
    fclose(fp);
    return (CuckooFilter<uint64_t, bits_per_item, SingleTable> *)(ret);
}

At destruction, you'll need to manually call your destructor (since it's a pointer), and std::free(ret->table). It's pretty clunky, but it at least worked for my purposes. Let me know if you have any problems or need some help!