Closed jrots closed 5 years ago
Hello @jrots, thank you for checking this module out!
First of all I would like to say that the amount of users you are talking about seems a good fit for memory-optimized data structures such as Roaring Bitmaps.
However I didn't quite understand what are you trying to index.
If you need to store arbitrary key/value pairs, I am not sure how can you do that with bitmaps, since they are naturally suited to store integer/bool pairs (e.g. user 1024 is cached, user 3333 is not cached).
And I am also not sure how to use Redis' HASH data type with Roaring, since the secret sauce behind Instagram's optimization is in the zipmap
implementation itself (see [1] and [2]), which stores small hastables as key-value strings, not so much in the buckets partition strategy.
Hi,
What I want to index are sort of "skiplists".
Users that have been "seen/ no voted" by other userids.
They can be small or large depending on the usage of the user.. given the above numbers, it would be on average a list of 72 userids per user (some users have very large lists however).
key: skiplist:
Hey,
Now I understand what you need to index.
Roaring Bitmaps seems to be a good fit for your use case, since one user would generally be "seen/no voted" by a small number of other users, and their IDs would be very sparse, such as you exemplified (1, 123123, etc.).
With regular bitmaps you'd need to store a great amount of empty bits representing the IDs not present in each user's set, while with compressed bitmaps you typically need much less memory.
If you try this module please give me your feedback. I'd like to know more about the usability of the API, missing commands, and perhaps performance benchmarks.
hi sorry for the late reply, tried bulk loading some data in it using redis roaring. Observed some issues:
used_memory_human
seems to be incorrect, but there is probably already an open issue for thisThe RDB file contains module data for the module ' ' that is not terminated by the proper module value EOF marker
The file I want to load is +/- 239 GB, contains +/- 200M unique list keys, each with a list of users.
I stopped the loading when +/- 260K list keys were loaded in memory and redis was using +/- 4GB at that time - If I extrapolate this to the whole dataset it would probably require
200000000/ 260000 => 770 x 4GB => 3TB?
to load everything . . so I suspect that there must be a memory leak somewhere,
tried the dump command and wanted to restart redis afterwards to see what memory usage then would be, but that failed.. so if that issue would be fixed I can retest and see what memory usage I get.
# Memory
used_memory:23925760
used_memory_human:22.82M
used_memory_rss:4057612288
used_memory_rss_human:3.78G
used_memory_peak:27584048
used_memory_peak_human:26.31M
used_memory_peak_perc:86.74%
used_memory_overhead:15497798
used_memory_startup:765664
used_memory_dataset:8427962
used_memory_dataset_perc:36.39%
total_system_memory:50640334848
total_system_memory_human:47.16G
...
db0:keys=262205,expires=0,avg_ttl=0
total used free shared buffers cached
Mem: 48294 48060 233 1 136 14336
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 522 41.3 8.0 4009436 3962512 pts/3 Sl+ 08:17 1:02 redis-server *:6379
root@elastic30:~# sudo pmap -x 522
522: redis-server *:6379
Address Kbytes RSS Dirty Mode Mapping
0000000000400000 1280 720 0 r-x-- redis-server
000000000073f000 4 4 4 r---- redis-server
0000000000740000 20 20 20 rw--- redis-server
0000000000745000 96 96 96 rw--- [ anon ]
000000000159b000 3930744 3929848 3929848 rw--- [ anon ]
00007fd621c00000 34816 24536 24536 rw--- [ anon ]
00007fd623fc7000 200 80 0 r-x-- libredis-roaring.so
00007fd623ff9000 2048 0 0 ----- libredis-roaring.so
00007fd6241f9000 4 4 4 r---- libredis-roaring.so
00007fd6241fa000 12 8 8 rw--- libredis-roaring.so
00007fd6241fd000 4 0 0 ----- [ anon ]
00007fd6241fe000 8192 8 8 rw--- [ anon ]
00007fd6249fe000 4 0 0 ----- [ anon ]
00007fd6249ff000 8192 8 8 rw--- [ anon ]
00007fd6251ff000 4 0 0 ----- [ anon ]
00007fd625200000 12288 6144 6144 rw--- [ anon ]
00007fd625ebe000 1784 632 0 r-x-- libc-2.19.so
00007fd62607c000 2048 0 0 ----- libc-2.19.so
00007fd62627c000 16 16 16 r---- libc-2.19.so
00007fd626280000 8 8 8 rw--- libc-2.19.so
00007fd626282000 20 16 16 rw--- [ anon ]
00007fd626287000 100 76 0 r-x-- libpthread-2.19.so
00007fd6262a0000 2044 0 0 ----- libpthread-2.19.so
00007fd62649f000 4 4 4 r---- libpthread-2.19.so
00007fd6264a0000 4 4 4 rw--- libpthread-2.19.so
00007fd6264a1000 16 4 4 rw--- [ anon ]
00007fd6264a5000 12 8 0 r-x-- libdl-2.19.so
00007fd6264a8000 2044 0 0 ----- libdl-2.19.so
00007fd6266a7000 4 4 4 r---- libdl-2.19.so
00007fd6266a8000 4 4 4 rw--- libdl-2.19.so
00007fd6266a9000 1044 68 0 r-x-- libm-2.19.so
00007fd6267ae000 2044 0 0 ----- libm-2.19.so
00007fd6269ad000 4 4 4 r---- libm-2.19.so
00007fd6269ae000 4 4 4 rw--- libm-2.19.so
00007fd6269af000 140 124 0 r-x-- ld-2.19.so
00007fd626bc5000 16 16 16 rw--- [ anon ]
00007fd626bce000 12 12 12 rw--- [ anon ]
00007fd626bd1000 4 4 4 r---- ld-2.19.so
00007fd626bd2000 4 4 4 rw--- ld-2.19.so
00007fd626bd3000 4 4 4 rw--- [ anon ]
00007ffcc8df9000 132 20 20 rw--- [ stack ]
00007ffcc8f32000 8 4 0 r-x-- [ anon ]
ffffffffff600000 4 0 0 r-x-- [ anon ]
---------------- ------- ------- -------
total kB 4009436 3962516 3960804
after stopping redis :
root@elastic30:~/redis-4.0.1# free -m
total used free shared buffers cached
Mem: 48294 44076 4218 1 125 14363
-/+ buffers/cache: 29586 18707
Swap: 1999 144 1855
save
2791:M 29 Aug 08:26:30.981 # The RDB file contains module data for the module ' ' that is not terminated by the proper module value EOF marker
Ah btw the size of the dump.rdb is :
620M -rw-r--r-- 1 root root 620M Aug 29 08:25 dump.rdb
Hey @jrots thank you for getting back!
Indeed the used_memory_human
statistics will be wrong due to a known issue #2.
I didn't have time to implement the fix, but I will work on it soon.
Regarding the two other problems you noticed, I'll take a look at them.
The current unit and integration tests don't report any memory leaks, so I believe there must be something wrong with my implementation of the module. Even if CRoaring's memory usage is not taken into account by redis, valgrind correctly identifies its potential leaks, from what I tested.
There are no integration tests for the SAVE
command, thank you for noticing. I guess it should be easy to find out what's wrong.
I'll keep you posted.
Hey @jrots, can you please try loading those keys again?
I have just updated the master branch correcting the SAVE
command, together with a memory leak fix on the RDB load function (sorry, this scenario wasn't being tested before -- I took the opportunity to add AOF tests as well).
I am still looking at the used_memory_human
issue, but I think it will take more time.
Closing this issue due to inactivity. Also, the bug with the SAVE
command has been fixed.
Nice module! Might look to use it for a project that involves a lot of "skiplists" of a lot of users & lookup/ getting these lists needs to be fast as possible.. having 250M users, that have skipped 18B ids in total, storing a roaring list / key/value pair for each of these users would require +/- 60GB in total (if my initial tests were right with some test data ) but taking : https://engineering.instagram.com/storing-hundreds-of-millions-of-simple-key-value-pairs-in-redis-1091ae80f74c in to account, it would make sense to spread these 250M in to buckets and have some hset compression on it, to further push that 60GB lower 😇. Could be that compression on 10K roaring lists is not that beneficial. .but nonetheless .. might be worth smth to explore?