dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
26.18k stars 964 forks source link

HyperLogLog takes always 12kb per key in Dragonfly (Sparse mode never used) #2679

Closed RafalBielickiIM closed 7 months ago

RafalBielickiIM commented 9 months ago

Describe the bug According to Redis docs HLL is supposed to use UP TO 12kb while dragonfly always uses 12kb. When running simple test Redis uses 20mb to store 100 000 HLL keys. Dragonfly on the other hand uses 1.5gb to store 100 000 HLL keys. Seems like the sparse mode is never used in DF.

To Reproduce Steps to reproduce the behaviuor:

  1. Insert 100 000 HLL keys to Dragonfly & Redis
  2. Compare memory used by both instances.

Example python script to insert keys

Expected behavior HLL should not take 12kb for small sets.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Reproducible Code Snippet

import redis
import random
import string

# Connect to Redis
r = redis.Redis(host='localhost', port=6379, db=1)

# Define the number of keys
N = 100_000

# Function to generate a random string
def random_string(length=10):
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(length))

# Create N random keys
for i in range(1, N+1):
    key = random_string()
    value = random_string()
    r.pfadd(key, value)

print(redis.info()['maxmemory_human'])

result for Redis 6.2.6

11.78M

results for Dragonfly 1.14.6

1.34GiB
chakaz commented 9 months ago

Hi @RafalBielickiIM Indeed, Dragonfly never uses the sparse representation, it always uses the dense one. It's not a bug, but more of a missing feature we currently did not prioritize.

RafalBielickiIM commented 9 months ago

Looking into code there is a server flag in redis hll_sparse_max_bytes that can modify when to use sparse mode. I cannot seem to find usage of it in dragonfly.

RafalBielickiIM commented 9 months ago

Hi @RafalBielickiIM Indeed, Dragonfly never uses the sparse representation, it always uses the dense one. It's not a bug, but more of a missing feature we currently did not prioritize.

Is there any chance we get it in the near future?

Looking at the code seems like there is some implementation for it. https://github.com/dragonflydb/dragonfly/blob/7c443f3a15d0f267753e0d80dc1e45c226bc8bea/src/redis/hyperloglog.c#L371

There is even a test for that https://github.com/dragonflydb/dragonfly/blob/7c443f3a15d0f267753e0d80dc1e45c226bc8bea/src/server/rdb_test.cc#L441

RafalBielickiIM commented 8 months ago

@chakaz any ideas/notes to the above? For now we handle low cardinality with hash + SET.

chakaz commented 8 months ago

It may take some time I'm afraid. I added the 'help wanted' label, maybe an external contributor can assist here as well.

romange commented 8 months ago

@azuredream hey, in case you are still looking for an interesting task to work on :)

azuredream commented 8 months ago

@azuredream hey, in case you are still looking for an interesting task to work on :)

Glad to help. Looking into this.

azuredream commented 8 months ago

Research completed. Coding...

azuredream commented 8 months ago

Encountered a problem. SDS in Redis allows pre-allocating space for string. If that's not the case in Dragonfly DB, we probably end up with much more (almost for every insertion) reallocations.

/* sdsalloc() = sdsavail() + sdslen() */ When creating a new sparse hll obj, Redis allocates sparselen bytes, and set value for the first HLL_HDR_SIZE+2 bytes.

int sparselen = HLL_HDR_SIZE + (((HLL_REGISTERS+(HLL_SPARSE_XZERO_MAX_LEN-1)) /
HLL_SPARSE_XZERO_MAX_LEN)*2);

When inserting more keys to a sparse hll, Redis enlarge it if required:

if (sdsalloc(o->ptr) < server.hll_sparse_max_bytes && sdsavail(o->ptr) < 3) {
        size_t newlen = sdslen(o->ptr) + 3;
        newlen += min(newlen, 300); /* Greediness: double 'newlen' if it is smaller than 300, or add 300 to it when it exceeds 300 */
        if (newlen > server.hll_sparse_max_bytes)
            newlen = server.hll_sparse_max_bytes;
        o->ptr = sdsResize(o->ptr, newlen, 1);
    }

In Dragonfly DB, is there a db object with .allocated_size and .used_size ? If not, we probably end up with much more (almost for every insertion) reallocation.

An alternative is always allocating the max size of sparse hll (Or progressively. 10%,50%,100%). It's a configurable value(min: 0, max: LONG_MAX, default3000) in Redis: server.hll_sparse_max_bytes

romange commented 8 months ago

yes, CompactObj::MallocUsed() gives you the allocated_size and CompactObject::Size() provides you with the actual string length (less or equal than MallocUsed()),

so SdsAvail equivalent would be MallocUsed() - Size()

chakaz commented 8 months ago

so SdsAvail equivalent would be MallocUsed() - Size()

There are cases where MallocUsed() returns 0 (if the string is inline), so we should also handle that case as well.

An alternative is always allocating the max size of sparse hll

I'd try to avoid that if possible. Now we allocate 15k per entry, and it looks like Redis allocates a minimum of 200 bytes.

azuredream commented 8 months ago

yes, CompactObj::MallocUsed() gives you the allocated_size and CompactObject::Size() provides you with the actual string length (less or equal than MallocUsed()),

so SdsAvail equivalent would be MallocUsed() - Size()

Thanks. It seems that CompactObj is a robj wrapper and can be stored to db. I'll try that. I probably need to refactor the logic for dense hll to keep consistency.

azuredream commented 8 months ago

so SdsAvail equivalent would be MallocUsed() - Size()

There are cases where MallocUsed() returns 0 (if the string is inline), so we should also handle that case as well.

An alternative is always allocating the max size of sparse hll

I'd try to avoid that if possible. Now we allocate 15k per entry, and it looks like Redis allocates a minimum of 200 bytes.

Please correct me if I was wrong. A string shorter than or equals to 16 bytes will be an inline string. static constexpr unsigned kInlineLen = 16; In this case, we don't need to worry about that because hll header takes 16 bytes. registers[] takes at least 2 bytes.

struct hllhdr {
  char magic[4];       /* "HYLL" */
  uint8_t encoding;    /* HLL_DENSE or HLL_SPARSE. */
  uint8_t notused[3];  /* Reserved for future use, must be zero. */
  uint8_t card[8];     /* Cached cardinality, little endian. */
  uint8_t registers[]; /* Data bytes. */
};

18 is theoretical minimum. In Redis, a new sparse hll string is 20 bytes by default. sparselen =16+4 = 20

int sparselen = HLL_HDR_SIZE + (((HLL_REGISTERS+(HLL_SPARSE_XZERO_MAX_LEN-1)) /
HLL_SPARSE_XZERO_MAX_LEN)*2);

If, as mentioned above, Redis allocates at least 200 bytes, that's even better.

chakaz commented 8 months ago

Thanks. It seems that CompactObj is a robj wrapper and can be stored to db. I'll try that. I probably need to refactor the logic for dense hll to keep consistency.

I am not sure exactly what you're trying to do, but note that we can't use our robj directly as sds string with the Redis API (see here).

If, as mentioned above, Redis allocates at least 200 bytes, that's even better.

Sorry, I don't know what's the real minimum Redis allocates. I only deducted that from the first comment of this bug. If you see that the minimum is 20, then that must be it. Sorry about confusing you.

If you think that indeed is the case, please add a DCHECK() to make sure our assumption holds.

Thanks!

azuredream commented 8 months ago

Sorry, I don't know what's the real minimum Redis allocates. I only deducted that from the first comment of this bug. If you see that the minimum is 20, then that must be it. Sorry about confusing you.

If you think that indeed is the case, please add a DCHECK() to make sure our assumption holds.

20 is just a theoretical value. Will do some experiments to see the behavior. I just wanted to confirm that <=16 bytes will be a inline string. DCHECK is good idea. Thanks for your advice.

azuredream commented 8 months ago

Could anyone share more about how to use the memory usage test script above? I ran it on Redis server, returned 0B. On Dragonfly DB server, maxmemory_human was 10GB before I insert anything.

romange commented 8 months ago

maxmemory_human is the memory limit, not memory usage. Dragonfly determines the limit automatically based on the available memory on its host