CodisLabs / codis

Proxy based Redis cluster solution supporting pipeline and scaling dynamically
MIT License
13.16k stars 2.69k forks source link

proxy的压测遇到的问题 #1323

Open huzilin opened 7 years ago

huzilin commented 7 years ago

配置文件

##################################################
#                                                #
#                  Codis-Proxy                   #
#                                                #
##################################################

# Set Codis Product Name/Auth.
product_name = "codis_test3"
product_auth = ""

# Set auth for client session
#   1. product_auth is used for auth validation among codis-dashboard,
#      codis-proxy and codis-server.
#   2. session_auth is different from product_auth, it requires clients
#      to issue AUTH <PASSWORD> before processing any other commands.
session_auth = ""

# Set bind address for admin(rpc), tcp only.
admin_addr = "0.0.0.0:11082"

# Set bind address for proxy, proto_type can be "tcp", "tcp4", "tcp6", "unix" or "unixpacket".
proto_type = "tcp4"
proxy_addr = "0.0.0.0:19200"

# Set jodis address & session timeout
#   1. jodis_name is short for jodis_coordinator_name, only accept "zookeeper" & "etcd".
#   2. jodis_addr is short for jodis_coordinator_addr
#   3. proxy will be registered as node:
#        if jodis_compatible = true (not suggested):
#          /zk/codis/db_{PRODUCT_NAME}/proxy-{HASHID} (compatible with Codis2.0)
#        or else
#          /jodis/{PRODUCT_NAME}/proxy-{HASHID}
jodis_name = ""
jodis_addr = ""
jodis_timeout = "20s"
jodis_compatible = false

# Set datacenter of proxy.
proxy_datacenter = ""

# Set max number of alive sessions.
proxy_max_clients = 10000

# Set max offheap memory size. (0 to disable)
proxy_max_offheap_size = "1024mb"

# Set heap placeholder to reduce GC frequency.
proxy_heap_placeholder = "256mb"

# Proxy will ping backend redis (and clear 'MASTERDOWN' state) in a predefined interval. (0 to disable)
backend_ping_period = "5s"

# Set backend recv buffer size & timeout.
backend_recv_bufsize = "128kb"
backend_recv_timeout = "30s"

# Set backend send buffer & timeout.
backend_send_bufsize = "128kb"
backend_send_timeout = "30s"

# Set backend pipeline buffer size.
backend_max_pipeline = 1024

# Set backend never read replica groups, default is false
backend_primary_only = false

# Set backend parallel connections per server
backend_primary_parallel = 1
backend_replica_parallel = 1

# Set backend tcp keepalive period. (0 to disable)
backend_keepalive_period = "75s"

# Set number of databases of backend.
backend_number_databases = 16

# If there is no request from client for a long time, the connection will be closed. (0 to disable)
# Set session recv buffer size & timeout.
session_recv_bufsize = "128kb"
session_recv_timeout = "30m"

# Set session send buffer size & timeout.
session_send_bufsize = "64kb"
session_send_timeout = "30s"

# Make sure this is higher than the max number of requests for each pipeline request, or your client may be blocked.
# Set session pipeline buffer size.
session_max_pipeline = 10000

# Set session tcp keepalive period. (0 to disable)
session_keepalive_period = "75s"

# Set session to be sensitive to failures. Default is false, instead of closing socket, proxy will send an error response to client.
session_break_on_failure = false

# Set metrics server (such as http://localhost:28000), proxy will report json formatted metrics to specified server in a predefined period.
metrics_report_server = ""
metrics_report_period = "1s"

# Set influxdb server (such as http://localhost:8086), proxy will report metrics to influxdb.
# metrics_report_influxdb_server = ""
# metrics_report_influxdb_period = "1s"
# metrics_report_influxdb_username = ""
# metrics_report_influxdb_password = ""
# metrics_report_influxdb_database = ""

# Set statsd server (such as localhost:8125), proxy will report metrics to statsd.
metrics_report_statsd_server = ""
metrics_report_statsd_period = "1s"
metrics_report_statsd_prefix = ""

压测命令

./redis-benchmark -p 19200 -h 10.100.90.20 -n 100000000 -r 10000000000 -c 512 -d 100 -t get,set,mset -q

压测

场景1

在两台机器同时向一个proxy压,并行启动两个redis-benchmark。 其中一个运行正常,另一个会报如下错误

Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
Writing to socket: Connection reset by peer
All clients disconnected... aborting.

场景2

在同一台机器(codis proxy所在机器)启动3个redis-benchmark。 前两个不会报错,第三个会报如下错误。

All clients disconnected... aborting.

问题

这个是否是配置有问题导致的还是proxy升级后有的问题。以前测试codis2.0的时候是没有这个问题的。

ghost commented 7 years ago

这个好像是连接超了吧

huzilin commented 7 years ago

proxy_max_clients = 10000设置10000,还需要哪里再设置么? 我才只建立了不到1500个连接,而在两个机器压的情况,连1024都答不到,后端redis没有限制连接数大小,我直接去压后端数据库,能压到1500+的连接

ghost commented 7 years ago

看下你系统打开的最大文件描述符

huzilin commented 7 years ago

open files (-n) 65535

我用root用户启动的

fancy-rabbit commented 7 years ago

看 codis-proxy 和 codis-server 进程的。cat /proc/$pid/limits

huzilin commented 7 years ago
# cat /proc/10445/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            10485760             unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1549038              1549038              processes 
Max open files            65535                65535                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1549038              1549038              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

日志里一直在输出read: connection reset by peer。目前benchmark一直在压测,并且测试工具还没有返回报错,只有512个线程

2017/08/09 18:59:35 session.go:78: [INFO] session [0xc42009cc00] create: {"ops":0,"create":1502276375,"remote":"10.100.90.20:54708"}
2017/08/09 18:59:35 session.go:85: [INFO] session [0xc42009cc00] closed: {"ops":0,"create":1502276375,"remote":"10.100.90.20:54708"}, error: read tcp4 10.100.90.20:19100->10.100.90.20:54708: read: connection reset by peer
2017/08/09 18:59:40 session.go:78: [INFO] session [0xc42009c000] create: {"ops":0,"create":1502276380,"remote":"10.100.90.21:47897"}
2017/08/09 18:59:40 session.go:85: [INFO] session [0xc42009c000] closed: {"ops":0,"create":1502276380,"remote":"10.100.90.21:47897"}, error: read tcp4 10.100.90.20:19100->10.100.90.21:47897: read: connection reset by peer
spinlock commented 7 years ago

connection reset by peer 这种错误一般都是 proxy 直接 reset 的结果吧。你把 proxy 的 loglevel 挑高了,从 proxy 的 log 里面应该有信息,以及 reset 的原因。

ghost commented 7 years ago

好像 是slots没有完全分配

spinlock commented 7 years ago

你的日志里,错误已经很明显了。出错是因为被 10.100.90.21:47897 把连接 reset 掉了导致的。所以你看看对应的日志。以及 proxy 前后的日志,找找有关的东西,包括与 codis-server 有关的日志。挖掘一下具体的原因。

spinlock commented 7 years ago

@Umbraller 似乎也不太是,如果是 slots 没有分配完全,proxy 的日志不会是 reset by xxx,我怀疑那个 xxx 是 codis-server。

ghost commented 7 years ago

@huzilin 你把槽那个截图下,

huzilin commented 7 years ago

slots截图如下 slots

ghost commented 7 years ago

那就是我分析的错了,应该是上面大佬们说的那些了,看下proxy的日志,把级别调成info或者是debug级别,再压测,再看日志

huzilin commented 7 years ago

@spinlock 那个不是codis-server,那个连接是短连接,隔几秒来一次,我现在换了一个端口,就没有按个报错。 但是连接数还是上不去。

我在两个server上同时执行

./redis-benchmark -p 19200 -h 10.100.90.20 -n 1000000000 -r 100000000000 -c 512 -t set

没有报错,然后再在第三台机器启动一个./redis-benchmark且只启动1个连接

# redis-benchmark -p 19200 -h 10.100.90.20 -n 1000000000 -r 100000000000 -c 1 -t set
Writing to socket: Connection reset by peer
All clients disconnected... aborting.

日志报错

2017/08/10 11:13:27 session.go:78: [INFO] session [0xc43f7a9b00] create: {"ops":0,"create":1502334807,"remote":"10.100.90.22:20406"}
2017/08/10 11:13:27 session.go:96: [INFO] session [0xc43f7a9b00] closed: {"ops":0,"create":1502334807,"remote":"10.100.90.22:20406"}, error: too many sessions

日志已经开启了debug最多就这些输出,没有其他输出。

spinlock commented 7 years ago

错误日志写的很清楚了。too many sessions 应该是 proxy_max_clients 超了。你检查一下 proxy.toml 是否确实生效了。你用浏览器访问 proxy 的 admin 端口能看到

huzilin commented 7 years ago

我换了个端口就好了,用的新配置没有改那里。 之前一直刷connection reset by pee,可能是redis-benchmark被压崩了还是其他什么探测程序一直连我设置的那个proxy端口导致的。

huzilin commented 7 years ago

之前read: connection reset by peer的原因

keepalived一直做ping活,所以日志会一直刷这个信息。

一个新问题

测试的时候,我们把一台服务器挂掉。然后那台机器上的proxy长时间处于timeout。 "ff87667b191d3fb7f1a86c126d229cb3": { "unixtime": 1502439539, "timeout": true }

在此期间所有的group都不能进行promote,知道timeout恢复。期间如果进行了promote的点击、或者删除proxy的点击都会导致dashboard夯住。请问这个timeout值可以配置么?

huzilin commented 7 years ago

还有 我当时为了在timeout期间,我想把这个proxy删除掉,因为点击移除失败,我尝试在zk里把那个proxy删除掉。删除后,我重启了dashboard发现信息还在dashboard里面能查到。那么再我想问一下这个信息是存储在哪里的?