More proper way to use Redis pipeline

wb14123 commented 8 years ago

Hi,

I've read the code and found bufferWrite in RedisWorkerIO is only used to buffer the writes between the sent of the write and the ack of the write. Since the ack should be very quick (compared to the receive of data) , the buffer should be very small. So I'm thinking what about add a config to set a minimal buffer size? When there are many writes it should reduce the number of network packages and thus improve the performance.

If you think this is a good idea, I can implement it and test if there are some improvement on performance.

etaty commented 8 years ago

What happen is: new commands -> buffer in bufferWrite (state is waiting) connected (state is writing) send data10s (state is waiting) buffer more commands here buffer more commands here buffer more commands here ... ack 10s later (state is writing) send data100s (state is waiting)

buffer even more commands here ... the buffer is huge now ack 100s later (state is writing) send data1000s

that's a problem when you have lot's of commands to send (need back propagation, ...) but very efficient with small burst

etaty commented 8 years ago

Rediscala is currently 2 actors (the writer + the decoder) (the writer is the main actor, passing the read tcp packets to the decoder actor) I think it could be interesting to split the writer in 2. One actor buffering the commands (encoding) One actor managing the akka io actor (tcp write / read )

I hope we should get more perf by avoiding message from akka io actor to be mixed in the mailbox with all the redis commands

wb14123 commented 8 years ago

Sorry but I don't get why it takes so long (10s or 100s) to receive ack? I thought it would be fast.

etaty commented 8 years ago

It's akka io ACK (not tcp ACK) http://doc.akka.io/docs/akka/current/scala/io-tcp.html#Throttling_Reads_and_Writes (it's the time the OS take to copy it in the network kernel

wb14123 commented 8 years ago

I try to add some debug messages and found it is not the case: the actor receives the ack so fast that the buffer is barely used.

You can see the added debug messages in this commit. Then I write a simple program to send commands with loop: with 10k commands per iter, wait it complete then next iter. While I run the program, all I got is almost always "sent without buffer". And the perf I got is like what I got with redis-benchmark with -P 1.

etaty commented 8 years ago

try with 100k and 250k you will see the buffering

etaty commented 8 years ago

If you try to send 1000 k, it should explode because the buffering will eat all the memory / garbage collection. As your jvm can creater message faster than, what you can send through the localhost network

wb14123 commented 8 years ago

Tried 1000k, still most of them is sent without buffer.

wb14123 commented 8 years ago

You can see the test program here: https://github.com/wb14123/redis-benchmark

etaty commented 8 years ago

ah but you use a redispool (of 100 clients), can you try with just 1 redisclient

etaty commented 8 years ago

1 redisclient, 1000 000 commands (in 1 batch)

wb14123 commented 8 years ago

Still without buffer, and it is much slower since the process of the futures takes more time.

etaty commented 8 years ago

Hum weird, Are you sure your the if in the buffer-debug doesn't hide your debug message?

wb14123 commented 8 years ago

Yes, with 100k, I've seen the buffered debug messages, though very little.

etaty commented 8 years ago

println the size of the buffer, you will see, it's growing

wb14123 commented 8 years ago

Oh, I've write the print message in write in the wrong place, I'll fix it.

wb14123 commented 8 years ago

I've fixed it and it shows that most of the writes is buffered. I also print the avg buffer lenght, with 200k it's around 1200 and with 1000k it's around 3800, do you think this buffer size is enough?

wb14123 commented 8 years ago

Since each command is very small in my case, so I think it's about 1000 commands per batch. It's kind of enough. I can close this issue. But I still don't know why I cannot get the perf like redis-benchmark. Maybe because the time to wait and process futures?

etaty commented 8 years ago

redis-benchmark is a bit different (look at the option to make it more similar), and there is no future management (future.sequence with 1Millions future is a killer)

wb14123 commented 8 years ago

I've changed my test code a little bit so that it doesn't need to manage so much futures while generating enough load (you can see my updated code in the repo I just mentioned), the code is like this:


  def get(): Unit = {
    val key = "some_key"
    val result = redisClient.get(key)
    result onSuccess { case _ => get() }
  }

  def benchmark() = {
    (0 to 20000) foreach { _ => get() }
  }

With this code I can get about 400k qps on my rmbp (4 cores). But it doesn't scale well: in a c4.8xlarge aws instance which has 36 cores, it still gets about 400k qps. And in c4.8xlarge, the buffer size is always 243. I may modify the code to set a minimal buffer size and test it again.

So what is the difference between redis-benchmark and this code? I know -t, -P and -c. I make -t to get and -c to 1 which is the same as my test code, however I cannot control -P with my test code.

wb14123 commented 8 years ago

It's very strange that it just cannot scale. I can set parallelism-factor = 0.3 and it can get the same performance with parallelism-factor = 1. If I set parallelism-factor = 0.3 and start running two program, it can get about 1100k qps. Almost the ability of Redis.

etaty commented 8 years ago

Well the scale up is limited by the cpu core. But you can scale horizontally on multiple cores, with the redispool

etaty commented 8 years ago

A redisclient use 2 actors, so in a benchmark it's equal to 2 thread (+1 thread for the akka io actor)

wb14123 commented 8 years ago

I've used redispool to test. But it doesn't help a lot for multiple cores. I've tried to limit parallelism-max = 4 on a 36 cores machine, it performs a lot better than don't have this limit, though just use 4 threads (If I don't set limit, can get 400k qps, if set limit, can get 700k qps). If I start a second program with this config on the same machine, it can get more than 1300k qps, which is already the ability of Redis.

wb14123 commented 8 years ago

What I said above is the situation with minimal buffer size of 10000. If I don't modify it, it can get better performance with one instance of the program with parallelism-max = 4 (about 800k), but it just get 1200k qps with two instances. And the CPU shows more system cpu usage on Redis core compared to limite the buffer size.

etaty / rediscala

More proper way to use Redis pipeline #142