Load balancing between redis slaves

ondrejbartas commented 11 years ago

Hello,

I am working on redis backup and failover situation where I have 3 frontend server, every server has its own redis + fronted application.

1 redis server will be started as master, other as slaves.

Redis Sentinel knows all servers in cluster and it would be very nice to use connection for read commands to nearest (fastest ping) redis server (every server will ask its own redis server for read and write into only one master)

When one of slaves goes down, then application using this server will switch into another redis server (doesn't matter if it is slave or master)

When master goes down, all read commands will proceed without fail (slaves will be used), write command will be with errors till redis Sentinel will switch one of slaves into master mode. Then all write commands will be switched to this new master.

Currently all requests are going just into master (if I read code of redis-sentinel right) and all slaves are not in use :(

What do you think about this approach?

If you like it, I will try to extend redis-sentinel to use this scenario.

Ondrej Bartas

flyerhzm commented 11 years ago

Sounds good, but I have some concerns for writing to master and reading from slave, also redis is fast, there is still some delay between master and slave. e.g.

someone uses redis as a global lock, separate read / write operates will break it.
it's also not good for redis as a message queue.

What do you think?

ondrejbartas commented 11 years ago

For me It should be as option to use only master or use nearest slave.

When you have for example stored in redis User configs (sessions, emails etc.) then you need redundancy and as fast connection as possible + you are writing into master only few times instead of reading which is every request to application server.
Using only master and change connection to use slave only in time when master is not present, when master is ready (doesn't matter if it is old master or ne slave which changed into master) then all machines will switch again into new/old master for all requests

BTW. There is a slight problem with sentinels. I tried start two redis servers (master + 1 slave), then start redis sentinels (2 - one for every server) and then test it like - kill slave, start slave, kill master, wait to slave become master, start old master, kill new master - redis sentinel until now worked good, but when you kill one sentinel in time of switch slave into master everything goes down. It is because redis sentinel doesn't verify connection into sentinel and desn't do reconnection :( This should be fixed in the beginning

What do you think? :D

royaltm commented 11 years ago

From the API point of view it could become a :read_preference option settable on redis client instance. When the list of available redis nodes with latency information is obtained from sentinel it would be rather trivial to implement the following read preferences:

:master_only - read only from master
:slave_only - read only from nearest slave
:slave_first - read from nearest slave if available and from master if none of the slaves are online
:nearest - read from nearest available node either from master or slave

As for the bug @ondrejbartas has encountered: could you create some scripts/example scenario to test what exactly is wrong? Some time ago I have tested many different fail-over scenarios and except the "too early master promotion" - already fixed bug there was no problem with redis_sentinel.

ondrejbartas commented 11 years ago

I will try to reproduce this error and write som script for it.

@royaltm your options (master_only, slave_only, slave_first, nearest) are good and they are covering all possible cases.

I will try to find how to determine which node is nearest or usable. For now I think about these strategies:

specify in Redis.new() by option which server to use as default, when down use random and when it is back use it again
by ping time (to server IP or connect to Redis server and test PING command) - measure needed time for ping
by ip address (sort it and use first one) - mostly localhost or 127.0.0.1 will be first one
round robin - cycle all slaves

Maybe I am going in bad way. I am not sure about this... When I was thinking about it little bit more, it will became maybe overkill for most of use-cases

ondrejbartas commented 11 years ago

@royaltm bug issue - I found problem with my config. I set:

sentinel monitor first_server 127.0.0.1 6379 2

and used only two redis & sentinels. By documentation is last option: (level of agreement needed to detect this master as failing of 2 sentinels (if the agreement is not reached the automatic failover does not start).)

when I changed it to: sentinel monitor first_server 127.0.0.1 6379 1

It started to work and switch between slave to master was without problems

This is my test file

require 'redis'
require 'redis-sentinel'

def pid_file port
  #for sure create tmop dir
  Dir.mkdir("tmp") unless File.exists?("tmp")

  "tmp/#{port}.pid"
end

def redis_running port
  File.exists?(pid_file(port)) && Process.kill(0, File.read(pid_file(port)).to_i)

rescue Errno::ESRCH
  FileUtils.rm pid_file(port)
  false
end

def start_redis port, slave_of = nil
  unless redis_running(port)
    command = "redis-server redis-config.conf --port #{port} --pidfile #{pid_file(port)} --dbfilename tmp/#{port}.rdb"
    command += " --slaveof #{slave_of}" if slave_of
    system command
    sleep(5) #fix for waiting to redis start to get pid
    puts "redis started on port: #{port} with PID: #{File.read(pid_file(port)).to_i}"
  else
    puts "redis already running on port: #{port} and with pid: #{File.read(pid_file(port)).to_i}"
  end
end

def start_sentinel port
  sentinel_port = 10000+port

  unless redis_running(sentinel_port)
    #need to create config for sentinel (I couldn't find way to start sentinel with config from command line :( )
    sentinel_conf_file = "tmp/sentinel_#{sentinel_port}.conf"
    fw = File.open(sentinel_conf_file, "w:UTF-8")
    fw.puts "pidfile #{pid_file(sentinel_port)}
    daemonize yes
    port #{sentinel_port}
    sentinel monitor first_server 127.0.0.1 #{port} 1
    sentinel down-after-milliseconds first_server 5000
    sentinel failover-timeout first_server 9000
    sentinel can-failover first_server yes
    sentinel parallel-syncs first_server 1"

    fw.close

    command = "redis-server #{sentinel_conf_file} --sentinel "
    system command
    sleep(1)
    puts "redis sentinel started on port: #{sentinel_port} with PID: #{File.read(pid_file(sentinel_port)).to_i}"
  else
    puts "redis sentinel already running on port: #{sentinel_port} and with pid: #{File.read(pid_file(sentinel_port)).to_i}"
  end
end

def stop_redis port
  if File.exists?(pid_file(port))
    Process.kill "INT", File.read(pid_file(port)).to_i
    puts "redis stopped on port: #{port} with PID:#{File.read(pid_file(port)).to_i}"
    FileUtils.rm pid_file(port)
  end
end

def start_redis_with_sentinel port, slave_of = nil
  start_redis port, slave_of
  start_sentinel port
end

puts "Stopping all redis"
stop_redis 13340
stop_redis 13341

puts "Stopping all sentinels"
stop_redis 23340
stop_redis 23341

start_redis_with_sentinel 13340
start_redis_with_sentinel 13341, "127.0.0.1 13340"

redis = Redis.new(:master_name => "first_server",
                  :sentinels => [
                    {:host => "localhost", :port => 23340},
                    {:host => "localhost", :port => 23341}
                  ],
                  :failover_reconnect_timeout => 30,
                  :failover_reconnect_wait => 0.0001)

redis.set "foo", 1

count = 0
while true

  if count == 30
    puts "killing master redis & it's sentinel"
    stop_redis 13340
    stop_redis 23340
  end

  if count == 120
    puts "starting again old master redis & sentinel"
    start_redis_with_sentinel 13340
  end

  if count == 150
    puts "killing current master redis & it's sentinel"
    stop_redis 13341
    stop_redis 23341
  end

  if count == 200
    puts "starting slave redis & sentinel"
    #using same config as before!
    start_redis_with_sentinel 13341, "127.0.0.1 13340"
  end

  if count == 250
    puts "killing master redis & it's sentinel"
    stop_redis 13340
    stop_redis 23340
  end

  begin
    data = redis.incr "foo"
    puts "current redis port #{redis.client.port} -> INCR: #{data}"
  rescue Redis::CannotConnectError => e
    puts "failover took too long to recover", e
  end
  count += 1
  sleep 1
end

and my redis config:

daemonize yes
port 16379
bind 127.0.0.1
timeout 0
loglevel notice
logfile stdout
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dir ./
slave-serve-stale-data yes
slave-read-only yes
slave-priority 100
appendonly no
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60

and output of script:

redis started on port: 13340 with PID: 13710
redis sentinel started on port: 23340 with PID: 13712
redis started on port: 13341 with PID: 13714
redis sentinel started on port: 23341 with PID: 13719
current redis port 13340 -> INCR: 2
.
.
current redis port 13340 -> INCR: 31
killing master redis & it's sentinel
redis stopped on port: 13340 with PID:13710
redis stopped on port: 23340 with PID:13712
trying nex sentinel!localhost:23341
current redis port 13341 -> INCR: 32
.
.
current redis port 13341 -> INCR: 121
starting again old master redis & sentinel
redis started on port: 13340 with PID: 13748
redis sentinel started on port: 23340 with PID: 13751
current redis port 13341 -> INCR: 122
.
.
current redis port 13341 -> INCR: 151
killing current master redis & it's sentinel
redis stopped on port: 13341 with PID:13714
redis stopped on port: 23341 with PID:13719
trying nex sentinel!localhost:23340
current redis port 13340 -> INCR: 152
.
.
current redis port 13340 -> INCR: 201
starting slave redis & sentinel
redis started on port: 13341 with PID: 13764
redis sentinel started on port: 23341 with PID: 13767
current redis port 13340 -> INCR: 202
.
.
current redis port 13340 -> INCR: 251
killing master redis & it's sentinel
redis stopped on port: 13340 with PID:13748
redis stopped on port: 23340 with PID:13751
trying nex sentinel!localhost:23341
current redis port 13341 -> INCR: 252
.
.

v6 commented 9 years ago

// , Ondrej, did you meet the requirement "Redis Sentinel knows all servers in cluster and it would be very nice to use connection for read commands to nearest (fastest ping) redis server"?

flyerhzm / redis-sentinel

Load balancing between redis slaves #11