flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

idset: exponential slowdown of idset_decode(3) with large integers #5287

Open grondo opened 1 year ago

grondo commented 1 year ago

idset_decode(3) shows an exponential performance issue when encoding large integers. The following test program demonstrates the issue:

#define _GNU_SOURCE
#include <flux/idset.h>
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <time.h>

static struct timespec ts_diff (struct timespec start, struct timespec end)
{
    struct timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return temp;
}

double since (struct timespec t0)
{
    struct timespec ts, d;
    clock_gettime (CLOCK_MONOTONIC, &ts);
    d = ts_diff (t0, ts);
    return ((double) d.tv_sec + (double) d.tv_nsec / 1000000000UL);
}

void test (const char *range)
{
    struct timespec ts;

    clock_gettime (CLOCK_MONOTONIC, &ts);
    struct idset *idset = idset_decode (range);
    char *s = idset_encode (idset, IDSET_FLAG_RANGE);
    printf ("%.3fs: %s\n", since (ts), s);
    free (s);
    idset_destroy (idset);
}

int main (int ac, char **av)
{
    unsigned int i = 100;
    for (; i < IDSET_INVALID_ID - 1; i*=2) {
        char *range;
        asprintf (&range, "%u", i);
        test (range);
        free (range);
    }
    return 0;
}
$ ./tst
0.000s: 100
0.000s: 200
0.000s: 400
0.000s: 800
0.000s: 1600
0.000s: 3200
0.000s: 6400
0.000s: 12800
0.000s: 25600
0.000s: 51200
0.001s: 102400
0.002s: 204800
0.004s: 409600
0.004s: 819200
0.007s: 1638400
0.021s: 3276800
0.043s: 6553600
0.113s: 13107200
0.220s: 26214400
0.437s: 52428800
0.874s: 104857600
1.514s: 209715200
3.029s: 419430400
6.006s: 838860800
12.112s: 1677721600
tst: veb.c:88: encode: Assertion `8*(b-1) < WORD' failed.
Aborted

The program aborts before we can test up to IDSET_INVALID_ID - 1. Thus I guess the idset implementation can't store the full range of unsigned integers (a separate issue).

The issue is probably academic at this time, because there isn't a real use case for very large idsets. Even a 10,000 node cluster with 128 cores per node would have a max task rank of 1280000.

I ran across this because I was trying to make an idset that matched "all" tasks by creating a large range. For now I can use some other approach.

garlick commented 1 year ago

I think the inability to set the maximum id is addressed by this change, FWIW:

diff --git a/src/common/libidset/idset.c b/src/common/libidset/idset.c
index d7ea5538f..85975e713 100644
--- a/src/common/libidset/idset.c
+++ b/src/common/libidset/idset.c
@@ -117,6 +117,8 @@ static int idset_grow (struct idset *idset, size_t size)

     while (newsize < size)
         newsize <<= 1;
+    if (newsize > IDSET_INVALID_ID)
+        newsize = IDSET_INVALID_ID;

     if (newsize > idset->T.M) {
         if (!(idset->flags & IDSET_FLAG_AUTOGROW)) {
garlick commented 1 year ago

idset_encode() walks through comma-separated ranges, processing each one sequentially. For each one the veb trie is reallocated to the size of the max id of the range (if necessary), and then idset_range_set() iterates over the range setting bits. This turns out to be poorly optimized for the case described since idset_range_set() ends up interating over the entire trie.

One idea for an optimization might be to check all the ranges first, and if there are more bits set than clear, then instead of the above, create a trie with all bits initially set, and walk the holes in the ranges, clearing bits.

grondo commented 1 year ago

This turns out to be poorly optimized for the case described since idset_range_set() ends up interating over the entire trie.

I'm possibly misunderstanding, but the testcase is only setting a single, albeit large, integer in the idset, so only one bit should be set as a result.

Edit: still, your idea seems solid to me.

garlick commented 1 year ago

Oh jeez, how'd I miss that. Sorry for the confusion!

grondo commented 1 year ago

Part of the problem may be this documented property of the vEB tree implementation:

has space efficiency comparable to a bitmap

I'm seeing a lot of time spent in idset_grow().

Not that this is important now, but I wonder if looking into whether there is some other data structure with similar properties of a vEB tree but better space efficiency would be useful in the future.

The Similar Data Structures section of the van Emde Boas wikipedia article states:

The O(M) space usage of vEB trees is an enormous overhead unless a large fraction of the universe of keys is being stored. This is one reason why vEB trees are not popular in practice.

Though I assume the implementation in flux-core has some of the suggested mitigations since I don't see 2G required to store IDSET_INVALID_ID - 1 (though it is pretty large ~500K to store one id), a suggested alternative was the y-fast trie which uses O(n) space, though I could not find a C implementation of it (and it sounds complicated).

Another alternative might be a compressed bitmap (though I'm unsure how well those implement predecessor/successor). Here's a C implementation of RoaringBitmap (silly name): https://github.com/RoaringBitmap/CRoaring

garlick commented 1 year ago

CRoaring looks like it has a compatible license and can be pulled in via their "amalgomated release". On the surface, its api would seem to be a usable replacement for libveb so for fun we could switch the libidset implementation over and see how it goes.

FWIW I did take the first step and pulled it in in my issue#5287 branch, specifically this commit: efc0aa716f4126f4b06168c996d53971929428de.

Edit: hmm, it appears that CRoaring requires C11 but at this time Flux is only requiring C99

grondo commented 5 months ago

We might want to take another look at this.

With large resource sets flux resource now appears to be bound by the performance of libidset, particularly the idset set operations which seem to be quite slow because they use iteration.

I did some quick comparisons between the roaring bitmap implementation and the current idset implementation.

idset roaring
Set 16K individual bits 567.593ms 126.163ms
Set 16K bits as range 303.266ms 0.038ms
iterate 16K bits 1299.053ms 0.107ms
test bit 0.130ms 0.041ms
set difference (16K - 10K) 3575.379ms 0.270ms

(to clarify, the above benchmarks were taken by iterating the tested operation 100 or 1000 times, depending on speed of the op)

What I haven't done is compare memory usage of roaring bitmaps, however I'm guessing the memory efficiency is much better than a vEB tree.

Also interesting is that roaring has a 64bit implementation. This might be useful in the future, e.g. this would be a very fast way to keep a set of jobids..

Perhaps we can consider requiring C11 to build flux-core to get this improvement.

garlick commented 5 months ago

Oh wow, nice results! Hmm, let me peek at that old branch again and refresh my memory on where this is.