dnbaker / sketch

C++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings
MIT License
154 stars 13 forks source link

Trouble including the library #27

Closed seryrzu closed 3 years ago

seryrzu commented 3 years ago

I have trouble understanding what is the intended way to include the library. The README says #include <sketch/sketch.h> but this file is in include/sketch/sketch.h. If I attempt to do #include "sketch/include/sketch/sketch.h", I get fatal error: aesctr/wy.h: No such file or directory (I did clone with submodules). Should I manually move all the directories like aesctr, vec next to sketch.h?

Are all submodules required? Blaze seems huge... I want to try out the count-min sketch, do I need the whole library for that?

Thanks!

dnbaker commented 3 years ago

Normally I would recommend cloning recursively. Blaze does take a while to copy, but it is not necessary for count-min. However, aesctr is. the git submodule system normally places it in the root directory of sketch -- you can either place it there, or clone it and add -I for the folder containing it.

For instance, the count-min/count-sketch test file compiles on my machine with this:

g++-10 -std=c++17 -O3 testsrc/mctest.cpp -march=native -I include/ -Iinclude/sketch -o mctest

Does that help?

I can also make it a little easier to extract out.

Thanks!

Daniel

dnbaker commented 3 years ago

Hi again,

Checking in -- I've removed some dependencies and restructured a little. The full library still requires submodules, but for a lot of cases, cloning without --recursive should still work as of this commit I've currently tested it for the ccm.h header.

Thanks,

Daniel

seryrzu commented 3 years ago

Thank you so much for a prompt reply and fix! I confirm that I was able to access the library with this commit and am now playing around with ccm_t. I will reply to this thread if I have any further questions.

seryrzu commented 3 years ago

Hi Daniel, may I ask couple of questions about the ccm_tclass? From what I figured cms constructor has 3 parameters:

Is that correct understanding?

Just to put it into context - I am trying to figure out if putting approximately 250 million elements into the sketch to count rare elements is feasible. Specifically, I am putting rolling hashes of all k-mers from, let's say, chromosome 1 of the human genome (default k>50, but that doesn't matter too much since I work with rolling hashes). The goal that I am pursuing is to figure out if I can have reasonable memory usage / performance to check the multiplicity of any k-mer from the chromosome once I put them all in the sketch. Most k-mers are unique and very few are frequent (appear more than > 10 times).

I care about rare and unique k-mers the most and don't care too much about precise estimates above 16. So, should get away with using 4 bits per counter. I tried to play around with the other two parameters but seems like for large setups (l2sz > 25 and nhashes > 60 - I would probably prefer to increase l2sz rather than number of hash functions, right?) I can't allocate enough memory? I get either a segfault or terminate called after throwing an instance of 'std::bad_alloc'. I have quite a lot of RAM on the server (more than a terabyte). Does that mean that I need to use some non-default allocator or this usecase is not supported?

Just in case this is what I am running

#include "include/sketch/ccm.h"

int main() {
    sketch::cm::ccm_t cms(4, 30, 20);
    size_t N = 250000000;

    for (size_t i = 0; i < N; ++i) {
        cms.add(i);
    }

    int res;
    int incorrect {0};
    for (size_t i = 0; i < N; ++i) {
        res = cms.est_count(i);
        if (res != 1) {
            incorrect += 1;
        }
    }
    printf("%d %f\n", incorrect, (double) incorrect / N);
}

Thank you! Andrey

dnbaker commented 3 years ago

Hi Andrey,

Thanks for the bug report! The array was sized to nhashes << l2sz, where nhashes was an int32_t, which was going to 0. I've fixed it now in main by changing it to a 64-bit integer via this pull request.

Parameter selection is going to depend a little on the count distributions you're expecting. Bigger tables make each estimator less likely to overestimate, but the probability of failure falls exponentially with the number of tables. You'll need a sufficiently large table to avoid over-counting, and then sufficiently many of them to provide a high probability of success.

There's guidance on table size setting in section 4.1 of the original Count-Min paper to make it most space-efficient. This would tend you toward having more but smaller tables to get the lowest error rate. On the other hand, each additional subtable is more cache misses, and so using more subtables costs more runtime at the expense of potentially lower over-counting.

Using your numbers of 250,000,000, I find that with 29 as the log2 of the table size and 5 subtables, the error rate is 0.14%, and with 10 subtables it drops to an error rate of in 1 in 176,000 using about 700MB of memory if only storing 4 bits of counters. That's not too bad. Let me know how it goes!

Thanks,

Daniel

seryrzu commented 3 years ago

Hi Daniel,

Thank you very much - this is very helpful. I finally got around to try the fix and everything seems fine. I will close the issue for now.

Andrey