DiceDB / dice

A drop-in replacement of Redis with SQL-based realtime reactivity.
https://dicedb-docs.netlify.app/
Other
965 stars 199 forks source link

Complexity with GC, mutex & storage engine layout #43

Open rohanverma94 opened 1 year ago

rohanverma94 commented 1 year ago

Is your feature request related to a problem? Please describe. Atomics and Mutexes are common solutions for locking. Behind the scenes, they are tuned for exclusive access and not for quick turn-around.

A common remedy to his problem is lock-free programming but not everything can be turned into a lock-free construct. Turnaround time should be minimal if the primary use case is a key-value store, moreover, if more data types and eviction policies are to be introduced in the future then this turnaround time would be significant with the current storage engine based on map[string]*object.

For instance, in LFU, while inserting key-value I've to keep track of the access frequency and if the store is found to be near its capacity, it should evict the least frequently used value, this is a classic example of a chain where every single operation being blocked by a mutex and releasing of those mutexes itself depend on the response time of storage layout.

Describe the solution you'd like Not just in Golang, but in C++ also we have been using constructs like singleton, monostate, etc which are designed around exclusivity. And to address those problems we use " signals and slots".

So the solution I propose is to use "Buffered channels" with a slot size of 1. The above problem of LFU policy can easily be modeled using slots, a buffered channel essentially operates on a slot being filled or empty and has the potential to make a quick turnaround.

As discussed earlier in https://github.com/DiceDB/dice/issues/31#issuecomment-1281692558 , a storage engine based on slices is friendly enough for GC and we should only use maps/hashtables for DB index ( db 0, db 1, db 2....) as on top of weird GC issues with maps, the computation for hash function also causes a significant delay on an "ms" scale.

Describe alternatives you've considered N/A

Additional context N/A

rohanverma94 commented 1 year ago

@arpitbbhayani If you'll approve of the changes proposed, I would start preparing design documents for new data structure.

arpitbbhayani commented 1 year ago

I really think keeping Dice DB single-threaded for evaluation engine will be a better choice.

LFU can be approximated, we don't need brute correctness. Approximation can be easily handled while being single threaded.

My idea of Dice DB is to

This way we keep things simple and straightforward for most.

Happy to chat.

rohanverma94 commented 1 year ago

What about the storage engine? At the end of the day, you would still want it to run on a Linux/Windows Server and you had plans with jemalloc, I'm expecting that you would start that once DB reaches a significant developer audience.

arpitbbhayani commented 1 year ago

Right now I was thinking of keeping it in-memory + periodic backups. Multi-threading will help when we go for disk storage (given the slowness of disk IO). We can decide to make them persistent later.

rohanverma94 commented 1 year ago

Okay, so I'll hold my design for now. Btw a lot of changes are coming in windows version. I just implemented a entire POSIX system call for windows and it was a headache. Fsync implementation for windows would be a lot harder, I could just wish it isn't going to be a huge tech debt for the windows platform in the future. With the POSIX platform, life is much simpler I must say!

image

arpitbbhayani commented 1 year ago

@rohanverma94 thanks a ton :) looking forward to it!

jain-yakshit-1 commented 1 year ago

Okay, so I'll hold my design for now. Btw a lot of changes are coming in windows version. I just implemented a entire POSIX system call for windows and it was a headache. Fsync implementation for windows would be a lot harder, I can just wish there wouldn't be a tech debt for the windows platform in the future. With the POSIX platform, life is much simpler I must say!

image

Wow! So much to learn. Looking forward.

rohanverma94 commented 1 year ago

@arpitbbhayani Inspite of automatic GC here

GODEBUG=gctrace=1 ./dice
2022/10/31 10:15:43 rolling the dice ⚄︎
2022/10/31 10:15:43 starting an asynchronous TCP server on 0.0.0.0 7379
gc 1 @126.086s 0%: 0.044+0.33+0.026 ms clock, 0.35+0.13/0.27/0+0.21 ms cpu, 3->3->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P
GC forced
gc 2 @247.568s 0%: 0.072+0.25+0.004 ms clock, 0.58+0/0.42/0+0.036 ms cpu, 2->2->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P
GC forced
gc 3 @375.271s 0%: 0.15+0.73+0.023 ms clock, 1.2+0/1.1/0+0.19 ms cpu, 0->0->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P

My INFO command is showing

127.0.0.1:7379> info
# Keyspace
db0:keys=680,expires=0,avg_ttl=0
db1:keys=0,expires=0,avg_ttl=0
db2:keys=0,expires=0,avg_ttl=0
db3:keys=0,expires=0,avg_ttl=0

Because of

KeyspaceStat[0]["keys"]++

This is wrong because I never used 680 keys in the first place. After the eviction was initiated following the threshold of KeyLimit, some weird GC issues are in play here. The code for this is here ioredis testing

This gets more broken after I repeat the test for 2nd time, given that on the first run the keys are already there. Now because my store is basically a map[string]*Object and it has weird GC behavior as described here https://github.com/DiceDB/dice/issues/31#issuecomment-1281692558 , the page_size or size of store fluctuates and it drops beneath the Key Limit, however, page_size climbs back to Key Limit and initiates GC again.

If I keep Key Limit well above my workload size then all issues disappear as it never triggers GC.

image

On ioredis client, successive testing for 2nd time lead to a broken state in DB i.e. "keys" in the output no longer exist, this was not broken during 1st run

Assertion failed: key25
Assertion failed: key156
Assertion failed: key195
Assertion failed: key269

In other words, these keys have become fodder for GC. Once GC eats this, it leaves the useless voids in map and my entire logic around page_size >= config.KeysLimit becomes a liability.

with following GC behavior

2022/10/31 11:35:26 starting an asynchronous TCP server on 0.0.0.0 7379
gc 8 @18.489s 0%: 0.033+1.9+0.003 ms clock, 0.26+0.16/1.7/2.0+0.031 ms cpu, 8->8->5 MB, 8 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 9 @18.502s 0%: 0.031+1.8+0.028 ms clock, 0.25+0.18/3.3/2.5+0.22 ms cpu, 10->11->10 MB, 12 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 10 @18.520s 0%: 0.069+1.9+0.032 ms clock, 0.55+0.81/3.4/3.3+0.26 ms cpu, 20->21->15 MB, 20 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 11 @18.574s 0%: 0.041+2.2+0.022 ms clock, 0.32+0.14/3.2/8.2+0.18 ms cpu, 30->30->16 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 12 @18.636s 0%: 0.040+2.6+0.021 ms clock, 0.32+0.19/4.6/9.1+0.17 ms cpu, 33->33->17 MB, 34 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 13 @18.696s 0%: 0.043+3.1+0.019 ms clock, 0.34+0.13/5.3/11+0.15 ms cpu, 34->35->19 MB, 36 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 14 @18.766s 0%: 0.037+3.2+0.020 ms clock, 0.30+0.15/6.0/13+0.16 ms cpu, 37->38->20 MB, 39 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 15 @18.839s 0%: 0.040+3.6+0.022 ms clock, 0.32+0.12/6.9/16+0.17 ms cpu, 40->40->22 MB, 42 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 16 @18.924s 0%: 0.038+5.0+0.021 ms clock, 0.30+0.13/7.1/17+0.17 ms cpu, 44->44->24 MB, 45 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 17 @19.008s 0%: 0.056+6.5+0.021 ms clock, 0.45+0.13/11/18+0.17 ms cpu, 46->47->28 MB, 49 MB goal, 0 MB stacks, 0 MB globals, 8 P
GC forced
gc 18 @139.095s 0%: 0.21+13+0.008 ms clock, 1.7+0/24/54+0.065 ms cpu, 33->33->27 MB, 57 MB goal, 0 MB stacks, 0 MB globals, 8 P
GC forced
gc 19 @261.157s 0%: 0.13+12+0.007 ms clock, 1.0+0/21/48+0.058 ms cpu, 27->27->27 MB, 55 MB goal, 0 MB stacks, 0 MB globals, 8 P
GC forced
gc 20 @381.188s 0%: 0.038+5.2+0.003 ms clock, 0.30+0/9.0/21+0.028 ms cpu, 27->27->27 MB, 55 MB goal, 0 MB stacks, 0 MB globals, 8 P
GC forced
gc 21 @501.208s 0%: 0.074+12+0.010 ms clock, 0.59+0/22/51+0.083 ms cpu, 27->27->27 MB, 55 MB goal, 0 MB stacks, 0 MB globals, 8 P
GC forced
gc 22 @621.240s 0%: 0.085+43+0.005 ms clock, 0.68+0/44/2.9+0.042 ms cpu, 27->27->27 MB, 55 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 23 @741.301s 0%: 0.057+17+0.004 ms clock, 0.46+0/0.79/17+0.035 ms cpu, 27->27->27 MB, 55 MB goal, 0 MB stacks, 0 MB globals, 8 P