RedisLabs / redis-cluster-proxy

A proxy for Redis clusters.
GNU Affero General Public License v3.0
1k stars 132 forks source link

(title changed) Problems with multiplexing and XREAD/XADD commands from 2 different connections #18

Closed bsergean closed 4 years ago

bsergean commented 4 years ago

Hi there,

I'm trying to debug a problem where things work on my mac, but not on Linux/Openshift on the 'real' thing. I have tried to set the log level to debug but I don't see received commands being printed on the console. It looks like this would help quite a bit in troubleshooting.

Thanks !

bsergean commented 4 years ago

ps: for info, I have a python service which uses 'manual' sharding, and I'm trying to simplify it by using redis-cluster. But the library I'm using now is aioredis, and it does not support redis-cluster.

https://github.com/machinezone/cobra

bsergean commented 4 years ago

(FYI here are all the commands sent, when things work, on my mac). I'm going to try to simplify and write a repro.

127.0.0.1:10002> MONITOR
OK
1579144211.634181 [0 127.0.0.1:61043] "EXISTS" "_health::sms_health_check_channel_ea5907c8"
1579144211.635006 [0 127.0.0.1:61245] "XREAD" "BLOCK" "0" "STREAMS" "_health::sms_health_check_channel_ea5907c8" "$"
1579144211.637348 [0 127.0.0.1:61049] "XADD" "_health::sms_health_check_channel_ea5907c8" "MAXLEN" "~" "10" "*" "json" "{\"action\":\"rtm/publish\",\"body\":{\"channel\":\"sms_health_check_channel_ea5907c8\",\"message\":{\"device\":{\"game\":\"test\",\"android_id\":\"4022b72746a5434db2b0a7797845d1b8\"},\"magic\":162}},\"id\":3}"
1579144211.638402 [0 127.0.0.1:61245] "XREAD" "BLOCK" "0" "STREAMS" "_health::sms_health_check_channel_ea5907c8" "1579144211637-0"
1579144211.642174 [0 127.0.0.1:61049] "DEL" "_health::sms_health_check_channel_ea5907c8"
1579144211.646631 [0 127.0.0.1:61049] "XADD" "_health::3083af3dbc46480db77344271de97b1a" "MAXLEN" "~" "10" "*" "json" "\"dd2e64a8e3fe43c9bad68e6a12a36c32\""
1579144211.649751 [0 127.0.0.1:61049] "XREVRANGE" "_health::3083af3dbc46480db77344271de97b1a" "+" "-" "COUNT" "1"
1579144211.652848 [0 127.0.0.1:61049] "DEL" "_health::3083af3dbc46480db77344271de97b1a"
1579144211.655631 [0 127.0.0.1:61049] "XREVRANGE" "_health::3083af3dbc46480db77344271de97b1a" "+" "-" "COUNT" "1"
bsergean commented 4 years ago

cluster status, if that helps

redis11:6379> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:8
cluster_size:4
cluster_current_epoch:49
cluster_my_epoch:48
cluster_stats_messages_ping_sent:186816
cluster_stats_messages_pong_sent:179333
cluster_stats_messages_meet_sent:7
cluster_stats_messages_update_sent:1
cluster_stats_messages_sent:366157
cluster_stats_messages_ping_received:179333
cluster_stats_messages_pong_received:186823
cluster_stats_messages_update_received:13
cluster_stats_messages_received:366169
artix75 commented 4 years ago

If you want to log received queries you can use the --dump-queries options (note: it still needs --log-level debug in order to work). This option will print the received query in the proxy's log, ie:

> set mykey 10

The proxy will print something like this in its log:

[2020-01-16 11:42:40] Req. 0:0:3 ARGV[0]: 'set'
[2020-01-16 11:42:40] Req. 0:0:3 ARGV[1]: 'mykey'
[2020-01-16 11:42:40] Req. 0:0:3 ARGV[2]: '10'
bsergean commented 4 years ago

Ah awesome let me try that.

Developers never read the doc, I wrote some doc and was complaining yesterday to a user who asking a question and didn't read it :)

On Jan 16, 2020, at 2:43 AM, artix notifications@github.com wrote:

If you want to log received queries you can use the --dump-queries options (note: it still needs --log-level debug in order to work). This option will print the received query in the proxy's log, ie:

set mykey 10

The proxy will print something like this in its log:

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/artix75/redis-cluster-proxy/issues/18?email_source=notifications&email_token=AC2O6UKCCOYIDBKKYES4HH3Q6A267A5CNFSM4KHNGB7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJDTRNQ#issuecomment-575092918, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UNOFEID3G4V7ISADYDQ6A267ANCNFSM4KHNGB7A.

bsergean commented 4 years ago

Here's more debugging output. First I 'subscribe to stream' with XREAD BLOCK, and then I write to the stream with XADD. I'm expecting something to come out, but nothing comes. Maybe it's a timing issue. I'll try to add more debugging (logging buffers ?).

[2020-01-16 18:02:30] Free cluster
[2020-01-16 18:02:38] Accepted connection from 172.24.176.6:35294
[2020-01-16 18:02:38] Client 1 connected from 172.24.176.6 (thread: 1)
[2020-01-16 18:02:38] Client 1 added to thread 1
[2020-01-16 18:02:38] Created Request 1:1:0
[2020-01-16 18:02:38] Parsing request 1:1:0, status: -1
[2020-01-16 18:02:38] Req. 1:1:0 ARGV[0]: 'PING'
[2020-01-16 18:02:38] Processing request 1:1:0
[2020-01-16 18:02:38] Created Request 1:1:1
[2020-01-16 18:02:38] Parsing request 1:1:1, status: -1
[2020-01-16 18:02:38] Req. 1:1:1 ARGV[0]: 'EXISTS'
[2020-01-16 18:02:38] Req. 1:1:1 ARGV[1]: '_health::sms_health_check_channel_cb71ff8b'
[2020-01-16 18:02:38] Processing request 1:1:1
[2020-01-16 18:02:38] Request 1:1:1 written to node 172.25.207.90:6379, adding it to pending requests
[2020-01-16 18:02:38] Still have 0 request(s) to send to node 172.25.207.90:6379  on thread 1
[2020-01-16 18:02:38] Reading reply from 172.25.207.90:6379 on thread 1...
[2020-01-16 18:02:38] Reply read complete for request 1:1:1, OK!
[2020-01-16 18:02:38] Writing reply for request 1:1:1 to client buffer...
[2020-01-16 18:02:38] Writing reply for request 1:1:1 to client buffer...
[2020-01-16 18:02:38] Free Request 1:1:1
[2020-01-16 18:02:38] Created Request 1:1:2
[2020-01-16 18:02:38] Parsing request 1:1:2, status: -1
[2020-01-16 18:02:38] Req. 1:1:2 ARGV[0]: 'XREAD'
[2020-01-16 18:02:38] Req. 1:1:2 ARGV[1]: 'BLOCK'
[2020-01-16 18:02:38] Req. 1:1:2 ARGV[2]: '0'
[2020-01-16 18:02:38] Req. 1:1:2 ARGV[3]: 'STREAMS'
[2020-01-16 18:02:38] Req. 1:1:2 ARGV[4]: '_health::sms_health_check_channel_cb71ff8b'
[2020-01-16 18:02:38] Req. 1:1:2 ARGV[5]: '$'
[2020-01-16 18:02:38] Processing request 1:1:2
[2020-01-16 18:02:38] Disabling multiplexing for client 1:1
[2020-01-16 18:02:38] Connecting to node 172.25.207.90:6379
[2020-01-16 18:02:38] Write handler installed into request 1:1:2 for node 172.25.207.90:6379
[2020-01-16 18:02:38] Created Request 1:0:1
[2020-01-16 18:02:38] Parsing request 1:0:1, status: -1
[2020-01-16 18:02:38] Req. 1:0:1 ARGV[0]: 'XADD'
[2020-01-16 18:02:38] Req. 1:0:1 ARGV[1]: '_health::sms_health_check_channel_cb71ff8b'
[2020-01-16 18:02:38] Req. 1:0:1 ARGV[2]: 'MAXLEN'
[2020-01-16 18:02:38] Req. 1:0:1 ARGV[3]: '~'
[2020-01-16 18:02:38] Req. 1:0:1 ARGV[4]: '1000'
[2020-01-16 18:02:38] Req. 1:0:1 ARGV[5]: '*'
[2020-01-16 18:02:38] Req. 1:0:1 ARGV[6]: 'json'
[2020-01-16 18:02:38] Req. 1:0:1 ARGV[7]: '{"action":"rtm/publish","body":{"channel":"sms_health_check_channel_cb71ff8b","message":{"device":{"game":"test","android_id":"a25f517bf2e34ee4b6f1900577a8396a"},"magic":723}},"id":3}'
[2020-01-16 18:02:38] Processing request 1:0:1
[2020-01-16 18:02:38] Request 1:0:1 written to node 172.25.207.90:6379, adding it to pending requests
[2020-01-16 18:02:38] Still have 0 request(s) to send to node 172.25.207.90:6379  on thread 1
[2020-01-16 18:02:38] Read reply handler installed for node 172.25.207.90:6379
[2020-01-16 18:02:38] Connected to node 172.25.207.90:6379 (private connection for client 1:1
[2020-01-16 18:02:38] Request 1:1:2 written to node 172.25.207.90:6379, adding it to pending requests
[2020-01-16 18:02:38] Still have 0 request(s) to send to node 172.25.207.90:6379 on private connection owned by 1:1
[2020-01-16 18:02:38] Reading reply from 172.25.207.90:6379 on thread 1...
[2020-01-16 18:02:38] Reply read complete for request 1:0:1, OK!
[2020-01-16 18:02:38] Writing reply for request 1:0:1 to client buffer...
[2020-01-16 18:02:38] Writing reply for request 1:0:1 to client buffer...
artix75 commented 4 years ago

@bsergean You should receive the reply. I'll take a look at it, it's probably a bug. I'll take a look at it, could you send me the complete log?

bsergean commented 4 years ago

If I restart the redis-proxy before I do my run, I actually have another behavior (which might be easier to diagnose).

I swear I've seen that on my mac before. I'll try to see if I can reproduce it and send you steps if possible.

Server log

Redis Cluster Proxy v0.0.1
[2020-01-16 18:25:49] The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
Cluster Address: 172.16.18.181:6379
Listening on port 7777
[2020-01-16 18:25:49] Starting 8 threads...
[2020-01-16 18:25:49] Creating thread 0...
Fetching cluster configuration...
[2020-01-16 18:25:49] Slot 0 -> node 6379
[2020-01-16 18:25:49] Slot 1 -> node 6379
[2020-01-16 18:25:49] Slot 2 -> node 6379
[2020-01-16 18:25:49] Slot 3 -> node 6379

.... (edited, lots of Slot printf)

[2020-01-16 18:25:49] All thread(s) started!
[2020-01-16 18:26:27] Accepted connection from 172.24.176.6:45352
[2020-01-16 18:26:27] Client 0 connected from 172.24.176.6 (thread: 0)
[2020-01-16 18:26:27] Client 0 added to thread 0
[2020-01-16 18:26:27] Created Request 0:0:0
[2020-01-16 18:26:27] Parsing request 0:0:0, status: -1
[2020-01-16 18:26:27] Request 0:0:0 buffer:
*1
$4
PING

[2020-01-16 18:26:27] Req. 0:0:0 ARGV[0]: 'PING'
[2020-01-16 18:26:27] Processing request 0:0:0
[2020-01-16 18:26:27] Created Request 0:0:1
[2020-01-16 18:26:27] Parsing request 0:0:1, status: -1
[2020-01-16 18:26:27] Request 0:0:1 buffer:
*2
$6
EXISTS
$42
_health::sms_health_check_channel_3c38692d

[2020-01-16 18:26:27] Req. 0:0:1 ARGV[0]: 'EXISTS'
[2020-01-16 18:26:27] Req. 0:0:1 ARGV[1]: '_health::sms_health_check_channel_3c38692d'
[2020-01-16 18:26:27] Processing request 0:0:1
[2020-01-16 18:26:27] Read reply handler installed for node 172.25.207.90:6379
[2020-01-16 18:26:27] Request 0:0:1 written to node 172.25.207.90:6379, adding it to pending requests
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 0] -> requests pending: [0:0:1]
[2020-01-16 18:26:27] Still have 0 request(s) to send to node 172.25.207.90:6379  on thread 0
[2020-01-16 18:26:27] Reading reply from 172.25.207.90:6379 on thread 0...
[2020-01-16 18:26:27] Reply read complete for request 0:0:1, OK!
[2020-01-16 18:26:27] Writing reply for request 0:0:1 to client buffer...
[2020-01-16 18:26:27] 
Reply for request 0:0:1:
:0

[2020-01-16 18:26:27] Writing reply for request 0:0:1 to client buffer...
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Free Request 0:0:1
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Created Request 0:0:2
[2020-01-16 18:26:27] Parsing request 0:0:2, status: -1
[2020-01-16 18:26:27] Request 0:0:2 buffer:
*6
$5
XREAD
$5
BLOCK
$1
0
$7
STREAMS
$42
_health::sms_health_check_channel_3c38692d
$1
$

[2020-01-16 18:26:27] Req. 0:0:2 ARGV[0]: 'XREAD'
[2020-01-16 18:26:27] Req. 0:0:2 ARGV[1]: 'BLOCK'
[2020-01-16 18:26:27] Req. 0:0:2 ARGV[2]: '0'
[2020-01-16 18:26:27] Req. 0:0:2 ARGV[3]: 'STREAMS'
[2020-01-16 18:26:27] Req. 0:0:2 ARGV[4]: '_health::sms_health_check_channel_3c38692d'
[2020-01-16 18:26:27] Req. 0:0:2 ARGV[5]: '$'
[2020-01-16 18:26:27] Processing request 0:0:2
[2020-01-16 18:26:27] Disabling multiplexing for client 0:0
[2020-01-16 18:26:27] Accepted connection from 172.24.176.6:45354
[2020-01-16 18:26:27] Client 0 connected from 172.24.176.6 (thread: 1)
[2020-01-16 18:26:27] Client 0 added to thread 1
[2020-01-16 18:26:27] Created Request 1:0:0
[2020-01-16 18:26:27] Parsing request 1:0:0, status: -1
[2020-01-16 18:26:27] Request 1:0:0 buffer:
*2
$3
DEL
$42
_health::sms_health_check_channel_3c38692d

[2020-01-16 18:26:27] Req. 1:0:0 ARGV[0]: 'DEL'
[2020-01-16 18:26:27] Req. 1:0:0 ARGV[1]: '_health::sms_health_check_channel_3c38692d'
[2020-01-16 18:26:27] Processing request 1:0:0
[2020-01-16 18:26:27] Read reply handler installed for node 172.25.207.90:6379
[2020-01-16 18:26:27] Request 1:0:0 written to node 172.25.207.90:6379, adding it to pending requests
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 1] -> requests pending: [1:0:0]
[2020-01-16 18:26:27] Still have 0 request(s) to send to node 172.25.207.90:6379  on thread 1
[2020-01-16 18:26:27] Reading reply from 172.25.207.90:6379 on thread 1...
[2020-01-16 18:26:27] Reply read complete for request 1:0:0, OK!
[2020-01-16 18:26:27] Writing reply for request 1:0:0 to client buffer...
[2020-01-16 18:26:27] 
Reply for request 1:0:0:
:0

[2020-01-16 18:26:27] Writing reply for request 1:0:0 to client buffer...
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Free Request 1:0:0
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Created Request 1:0:1
[2020-01-16 18:26:27] Client 0 from 172.24.176.6 closed connection (thread: 1)
[2020-01-16 18:26:27] Freeing client 0 (thread: 1)
[2020-01-16 18:26:27] Free Request 1:0:1
[2020-01-16 18:26:27] Node 172.16.18.181:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Node 172.27.240.190:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Node 172.29.163.161:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Node 172.30.217.227:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Node 172.25.241.10:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Node 172.26.135.104:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Node 172.26.145.207:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 1] -> requests pending: []
[2020-01-16 18:26:27] Connecting to node 172.25.207.90:6379
[2020-01-16 18:26:27] Write handler installed into request 0:0:2 for node 172.25.207.90:6379
[2020-01-16 18:26:27] Read reply handler installed for node 172.25.207.90:6379
[2020-01-16 18:26:27] Connected to node 172.25.207.90:6379 (private connection for client 0:0
[2020-01-16 18:26:27] Request 0:0:2 written to node 172.25.207.90:6379, adding it to pending requests
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 0] -> requests pending: [0:0:2]
[2020-01-16 18:26:27] Still have 0 request(s) to send to node 172.25.207.90:6379 on private connection owned by 0:0
[2020-01-16 18:26:27] Created Request 0:0:3
[2020-01-16 18:26:27] Client 0 from 172.24.176.6 closed connection (thread: 0)
[2020-01-16 18:26:27] Freeing client 0 (thread: 0)
[2020-01-16 18:26:27] Free Request 0:0:3
[2020-01-16 18:26:27] Node 172.16.18.181:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Node 172.27.240.190:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Node 172.29.163.161:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Node 172.30.217.227:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Node 172.25.241.10:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Node 172.26.135.104:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Node 172.26.145.207:6379[thread 0] -> requests pending: []
[2020-01-16 18:26:27] Free Request 0:0:2
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 0] -> requests pending: [NULL]
[2020-01-16 18:26:27] Node 172.25.207.90:6379[thread 0] -> requests pending: [NULL]

Client log

~ $ 
~ $ cobra health
2020-01-16 18:26:27 INFO client > {"action":"auth/handshake","body":{"data":{"role":"health"},"method":"role_secret"},"id":0}
2020-01-16 18:26:27 INFO client < {'action': 'auth/handshake/ok', 'id': 0, 'body': {'data': {'nonce': 'MTA2NTM4MzEzNTY3NTgzMTk1NjI=', 'version': '2.6.6', 'connection_id': '8735da6fc765', 'node': 'cobra-cluster-3-b22ph'}}}
2020-01-16 18:26:27 INFO client > {"action":"auth/authenticate","body":{"method":"role_secret","credentials":{"hash":"54VyxjWNWc60YOIg5p+EiQ=="}},"id":1}
2020-01-16 18:26:27 INFO client < {'action': 'auth/authenticate/ok', 'id': 1, 'body': {}}
2020-01-16 18:26:27 INFO client > {"action":"rtm/subscribe","body":{"subscription_id":"sms_health_check_channel_3c38692d","channel":"sms_health_check_channel_3c38692d","fast_forward":true,"filter":"select magic,device.android_id\n                     from `sms_health_check_channel_3c38692d` where\n                         device.game = 'test' AND\n                         device.android_id = '7f21637231474c679442965b369ced8d' AND\n                         magic = 832\n    ","batch_size":1},"id":2}
2020-01-16 18:26:27 INFO client < {'action': 'rtm/subscribe/ok', 'id': 2, 'body': {'position': '1519190184:559034812775', 'subscription_id': 'sms_health_check_channel_3c38692d', 'redis_node': 'redis-proxy', 'stream_exists': False, 'stream_length': 0}}
2020-01-16 18:26:27 INFO client > {"action":"rtm/publish","body":{"channel":"sms_health_check_channel_3c38692d","message":{"device":{"game":"test","android_id":"7f21637231474c679442965b369ced8d"},"magic":832}},"id":3}
2020-01-16 18:26:27 INFO client < {'action': 'rtm/publish/error', 'id': 3, 'body': {'error': "publish: cannot connect to redis 'NoneType' object has no attribute 'write'"}}
System is unhealthy !!: publish: cannot connect to redis 'NoneType' object has no attribute 'write'
2020-01-16 18:26:27 INFO client > {"action":"rtm/delete","body":{"channel":"sms_health_check_channel_3c38692d"},"id":4}
2020-01-16 18:26:27 INFO client < {'action': 'rtm/delete/ok', 'id': 4, 'body': {}}
~ $ 
artix75 commented 4 years ago

@bsergean Ok, I've found the issue. Since XREAD is a used in the blocking way, the client that sends the query is removed from the multiplexing context (shared connection) and it privately connects to the cluster. So, it actually reconnects to the cluster. The other client sends the XADD query before the first client has connected and completely written the XREAD query, and since the first client is reading from special ID '$' (messages added to the stream starting from the moment the node receives the query), it won't receive the message added by the other client (that has actually written XADD before XREAD has been received). Anyway, it should receive messages added to the stream after the XREAD query has been delivered to the instance (you can easily try this by sending further XADD queries).

In a "non-proxy" context, a similar situation could even happen when:

  1. Client A sends XREAD
  2. Client B sends XADD, but the query is written to Redis before XREAD sent from client A (ie. because it could connect faster than client A).

In the proxy's context, this effect could be mitigated by using a pool of already-connected connections, so when a client reconnects using a private connection, it could use an already-connected socket and deliver its query in a faster way. Anyway, this would not completely solve your issue, since client A and client B are actually using different connections to the node.

I've already planned to implement a connection pool before the final release.

bsergean commented 4 years ago

I'm trying to run with only one thread to see if it makes a difference.

Now the second client can connect, and we're back to the previous behavior, where the XADD makes it through, but the XREAD does not pick it up.

[2020-01-16 18:39:51] Slot 16380 -> node 6379
[2020-01-16 18:39:51] Slot 16381 -> node 6379
[2020-01-16 18:39:51] Slot 16382 -> node 6379
[2020-01-16 18:39:51] Slot 16383 -> node 6379
Cluster has 4 masters and 4 replica(s)

[2020-01-16 18:39:51] All thread(s) started!
[2020-01-16 18:40:35] Accepted connection from 172.24.176.6:52308
[2020-01-16 18:40:35] Client 0 connected from 172.24.176.6 (thread: 0)
[2020-01-16 18:40:35] Client 0 added to thread 0
[2020-01-16 18:40:35] Created Request 0:0:0
[2020-01-16 18:40:35] Parsing request 0:0:0, status: -1
[2020-01-16 18:40:35] Request 0:0:0 buffer:
*1
$4
PING

[2020-01-16 18:40:35] Req. 0:0:0 ARGV[0]: 'PING'
[2020-01-16 18:40:35] Processing request 0:0:0
[2020-01-16 18:40:35] Created Request 0:0:1
[2020-01-16 18:40:35] Parsing request 0:0:1, status: -1
[2020-01-16 18:40:35] Request 0:0:1 buffer:
*2
$6
EXISTS
$42
_health::sms_health_check_channel_46051d42

[2020-01-16 18:40:35] Req. 0:0:1 ARGV[0]: 'EXISTS'
[2020-01-16 18:40:35] Req. 0:0:1 ARGV[1]: '_health::sms_health_check_channel_46051d42'
[2020-01-16 18:40:35] Processing request 0:0:1
[2020-01-16 18:40:35] Read reply handler installed for node 172.29.163.161:6379
[2020-01-16 18:40:35] Request 0:0:1 written to node 172.29.163.161:6379, adding it to pending requests
[2020-01-16 18:40:35] Node 172.29.163.161:6379[thread 0] -> requests pending: [0:0:1]
[2020-01-16 18:40:35] Still have 0 request(s) to send to node 172.29.163.161:6379  on thread 0
[2020-01-16 18:40:35] Reading reply from 172.29.163.161:6379 on thread 0...
[2020-01-16 18:40:35] Reply read complete for request 0:0:1, OK!
[2020-01-16 18:40:35] Writing reply for request 0:0:1 to client buffer...
[2020-01-16 18:40:35] 
Reply for request 0:0:1:
:0

[2020-01-16 18:40:35] Writing reply for request 0:0:1 to client buffer...
[2020-01-16 18:40:35] Node 172.29.163.161:6379[thread 0] -> requests pending: []
[2020-01-16 18:40:35] Free Request 0:0:1
[2020-01-16 18:40:35] Node 172.29.163.161:6379[thread 0] -> requests pending: []
[2020-01-16 18:40:35] Created Request 0:0:2
[2020-01-16 18:40:35] Parsing request 0:0:2, status: -1
[2020-01-16 18:40:35] Request 0:0:2 buffer:
*6
$5
XREAD
$5
BLOCK
$1
0
$7
STREAMS
$42
_health::sms_health_check_channel_46051d42
$1
$

[2020-01-16 18:40:35] Req. 0:0:2 ARGV[0]: 'XREAD'
[2020-01-16 18:40:35] Req. 0:0:2 ARGV[1]: 'BLOCK'
[2020-01-16 18:40:35] Req. 0:0:2 ARGV[2]: '0'
[2020-01-16 18:40:35] Req. 0:0:2 ARGV[3]: 'STREAMS'
[2020-01-16 18:40:35] Req. 0:0:2 ARGV[4]: '_health::sms_health_check_channel_46051d42'
[2020-01-16 18:40:35] Req. 0:0:2 ARGV[5]: '$'
[2020-01-16 18:40:35] Processing request 0:0:2
[2020-01-16 18:40:35] Disabling multiplexing for client 0:0
[2020-01-16 18:40:35] Accepted connection from 172.24.176.6:52310
[2020-01-16 18:40:35] Client 1 connected from 172.24.176.6 (thread: 0)
[2020-01-16 18:40:35] Connecting to node 172.29.163.161:6379
[2020-01-16 18:40:35] Write handler installed into request 0:0:2 for node 172.29.163.161:6379
[2020-01-16 18:40:35] Client 1 added to thread 0
[2020-01-16 18:40:35] Created Request 0:1:0
[2020-01-16 18:40:35] Parsing request 0:1:0, status: -1
[2020-01-16 18:40:35] Request 0:1:0 buffer:
*8
$4
XADD
$42
_health::sms_health_check_channel_46051d42
$6
MAXLEN
$1
~
$4
1000
$1
*
$4
json
$183
{"action":"rtm/publish","body":{"channel":"sms_health_check_channel_46051d42","message":{"device":{"game":"test","android_id":"5df8926f923e4d3c83790b2b260215bc"},"magic":224}},"id":3}

[2020-01-16 18:40:35] Req. 0:1:0 ARGV[0]: 'XADD'
[2020-01-16 18:40:35] Req. 0:1:0 ARGV[1]: '_health::sms_health_check_channel_46051d42'
[2020-01-16 18:40:35] Req. 0:1:0 ARGV[2]: 'MAXLEN'
[2020-01-16 18:40:35] Req. 0:1:0 ARGV[3]: '~'
[2020-01-16 18:40:35] Req. 0:1:0 ARGV[4]: '1000'
[2020-01-16 18:40:35] Req. 0:1:0 ARGV[5]: '*'
[2020-01-16 18:40:35] Req. 0:1:0 ARGV[6]: 'json'
[2020-01-16 18:40:35] Req. 0:1:0 ARGV[7]: '{"action":"rtm/publish","body":{"channel":"sms_health_check_channel_46051d42","message":{"device":{"game":"test","android_id":"5df8926f923e4d3c83790b2b260215bc"},"magic":224}},"id":3}'
[2020-01-16 18:40:35] Processing request 0:1:0
[2020-01-16 18:40:35] Request 0:1:0 written to node 172.29.163.161:6379, adding it to pending requests
[2020-01-16 18:40:35] Node 172.29.163.161:6379[thread 0] -> requests pending: [0:1:0]
[2020-01-16 18:40:35] Still have 0 request(s) to send to node 172.29.163.161:6379  on thread 0
[2020-01-16 18:40:35] Read reply handler installed for node 172.29.163.161:6379
[2020-01-16 18:40:35] Connected to node 172.29.163.161:6379 (private connection for client 0:0
[2020-01-16 18:40:35] Request 0:0:2 written to node 172.29.163.161:6379, adding it to pending requests
[2020-01-16 18:40:35] Node 172.29.163.161:6379[thread 0] -> requests pending: [0:0:2]
[2020-01-16 18:40:35] Still have 0 request(s) to send to node 172.29.163.161:6379 on private connection owned by 0:0
[2020-01-16 18:40:35] Reading reply from 172.29.163.161:6379 on thread 0...
[2020-01-16 18:40:35] Reply read complete for request 0:1:0, OK!
[2020-01-16 18:40:35] Writing reply for request 0:1:0 to client buffer...
[2020-01-16 18:40:35] 
Reply for request 0:1:0:
$15
1579200035194-0

[2020-01-16 18:40:35] Writing reply for request 0:1:0 to client buffer...
[2020-01-16 18:40:35] Node 172.29.163.161:6379[thread 0] -> requests pending: []
[2020-01-16 18:40:35] Free Request 0:1:0
bsergean commented 4 years ago

Cool just read your previous answer. You're right that even in a non proxy way it's possible to run into this problem, but practically I haven't yet.

I'll see what I can do to make the XREAD connection wait a bit ; or change the ordering so that the message is published/xadd'ed first, and then we xread it. That order should work fine as we get an answer from xadd when it has completed.

artix75 commented 4 years ago

@bsergean Yeah, the reason why it practically never happens when you're not using the proxy is that Redis Cluster Proxy actually reconnects the clients that call the blocking XREAD.

So this is what is happening:

  1. Client A is connected to the shared (multiplex) connection and sends a blocking XREAD.
  2. The proxy removes Client A from the shared connection since it's requiring a blocking command that would block all other clients. The XREAD query is queued and Client A receives a private connection to the cluster, so Client A starts reconnecting to the cluster.
  3. Client B sends the XADD query, that is delivered to the cluster.
  4. After Client A reconnects, it writes the XREAD query and starts reading messages on the stream since now on, but the XADD query sent from Client B has already been written.

The connection pool that I will implement will provide clients that have been removed from the shared connection with a private connection that has its sockets already connected to the nodes of the cluster: this will result in a much faster query execution that would be more similar to what happens when you execute those queries without the proxy.

As for now, you can force all clients that must send a blocking XREAD query to be removed from the shared connection before sending XREAD by using the special PROXY command:

PROXY MULTIPLEXING OFF

This will remove the calling client from the shared connection and will provide a private connection to the client itself. Anyway, the new private connection is currently working in a lazy way, so it will actually connect to the node only when the first query will be received.

Anyway, try this:

# CLIENT A

PROXY MULTIPLEXING OFF
EXISTS _health::sms_health_check_channel_46051d42
XREAD BLOCK 0 STREAMS _health::sms_health_check_channel_46051d42 $

# CLIENT B

XADD _health::sms_health_check_channel_46051d42 ....

You can also disable multiplexing by default for all the clients by launching the proxy with the --disable-multiplexing always option.

bsergean commented 4 years ago

Adding PROXY MULTIPLEXING OFF is slightly involved, so I tried to use the --disable-multiplexing always option. It 'almost' make it works :)

The first request still yield a weird disconnection, but then the subsequent request work as expected, which is great. This allowed me to move on to other kind of testing ...

Given that this option can trigger behavior changes, I wonder if it should be opt-in instead of opt-out, to ease the adoption of the proxy.

artix75 commented 4 years ago

@bsergean The --disable-multiplexing always is only for debugging purposes. If multiplexing is always disabled, every client will get its own private connections that could result in a huge amount of open sockets (number of nodes * number of clients). The best way to handle issues like the one you described is the "connection pool" I was talking about in the previous comment. It will be probably implemented after the first beta is released in the next days.

bsergean commented 4 years ago

Ok get it. Thanks for the explanation.

On Jan 17, 2020, at 8:31 AM, artix notifications@github.com wrote:

@bsergean https://github.com/bsergean The --disable-multiplexing always is only for debugging purposes. If multiplexing is always disabled, every client will get its own private connections that could result in a huge amount of open sockets (number of nodes * number of clients). The best way to handle issues like the one you described is the "connection pool" I was talking about in the previous comment. It will be probably implemented after the first beta is released in the next days.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/artix75/redis-cluster-proxy/issues/18?email_source=notifications&email_token=AC2O6UINV4FY4JRZLBTCU33Q6HMORA5CNFSM4KHNGB7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIHL7Q#issuecomment-575698430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UJCAP24Y5K2PBKOIMDQ6HMORANCNFSM4KHNGB7A.

artix75 commented 4 years ago

@bsergean I'm implementing the connections pool feature in the connections_pool branch. You can give it a try if you want.

bsergean commented 4 years ago

Great I'll give it a try. (found a typo in a header file btw: clients with priovate connections).

The program that was triggering the problem was an internal healthcheck, and I have changed based on your feedback to do the XADD first, and then the XREAD, so maybe regardless of the connection pool the bug could have disappeared. I have a branch with the old behavior and I'll see if this helps.

bsergean commented 4 years ago

Unfortunately I'm still getting the 'hang'... I removed the --disable-multiplexing option ; the connection pool is on by default right ?

Redis Cluster Proxy v999.999.999 (unstable)
[2020-01-22 06:47:23] The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[2020-01-22 06:47:23] Not listening to IPv6: unsupported
Cluster Address: 172.16.18.181:6379
Listening on *:7777
[2020-01-22 06:47:23] Starting 8 threads...
Fetching cluster configuration...
Cluster has 4 masters and 4 replica(s)

Full log

[2020-01-22 06:56:10] Node 172.30.217.227:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:22] Created Request 1:2:3
[2020-01-22 06:56:22] Client 2 from 172.26.199.97 closed connection (thread: 1)
[2020-01-22 06:56:22] Freeing client 2 (thread: 1)
[2020-01-22 06:56:22] Free Request 1:2:3
[2020-01-22 06:56:22] Node 172.16.18.181:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:22] Node 172.27.240.190:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:22] Node 172.29.163.161:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:22] Free Request 1:2:2
[2020-01-22 06:56:22] Node 172.30.217.227:6379[thread 1] -> requests pending: [NULL]
[2020-01-22 06:56:22] Node 172.30.217.227:6379[thread 1] -> requests pending: [NULL]
[2020-01-22 06:56:22] Node 172.25.241.10:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:22] Node 172.26.135.104:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:22] Node 172.26.145.207:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:22] Node 172.25.207.90:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:22] Free cluster
[2020-01-22 06:56:24] Accepted connection from 172.26.199.97:42182
[2020-01-22 06:56:24] Client 3 connected from 172.26.199.97 (thread: 1)
[2020-01-22 06:56:24] Client 3 added to thread 1
[2020-01-22 06:56:24] Created Request 1:3:0
[2020-01-22 06:56:24] Parsing request 1:3:0, status: -1
[2020-01-22 06:56:24] Request 1:3:0 buffer:
*1
$4
PING

[2020-01-22 06:56:24] Req. 1:3:0 ARGV[0]: 'PING'
[2020-01-22 06:56:24] Processing request 1:3:0
[2020-01-22 06:56:24] Created Request 1:3:1
[2020-01-22 06:56:24] Parsing request 1:3:1, status: -1
[2020-01-22 06:56:24] Request 1:3:1 buffer:
*2
$6
EXISTS
$42
_health::sms_health_check_channel_e25ec0a1

[2020-01-22 06:56:24] Req. 1:3:1 ARGV[0]: 'EXISTS'
[2020-01-22 06:56:24] Req. 1:3:1 ARGV[1]: '_health::sms_health_check_channel_e25ec0a1'
[2020-01-22 06:56:24] Processing request 1:3:1
[2020-01-22 06:56:24] Request 1:3:1 written to node 172.27.240.190:6379, adding it to pending requests
[2020-01-22 06:56:24] Node 172.27.240.190:6379[thread 1] -> requests pending: [1:3:1]
[2020-01-22 06:56:24] Still have 0 request(s) to send to node 172.27.240.190:6379  on thread 1
[2020-01-22 06:56:24] Reading reply from 172.27.240.190:6379 on thread 1...
[2020-01-22 06:56:24] Reply read complete for request 1:3:1, OK!
[2020-01-22 06:56:24] 
Reply for request 1:3:1:
:0

[2020-01-22 06:56:24] Writing reply for request 1:3:1 to client buffer...
[2020-01-22 06:56:24] Node 172.27.240.190:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:24] Free Request 1:3:1
[2020-01-22 06:56:24] Node 172.27.240.190:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:24] Created Request 1:3:2
[2020-01-22 06:56:24] Parsing request 1:3:2, status: -1
[2020-01-22 06:56:24] Request 1:3:2 buffer:
*6
$5
XREAD
$5
BLOCK
$1
0
$7
STREAMS
$42
_health::sms_health_check_channel_e25ec0a1
$1
$

[2020-01-22 06:56:24] Req. 1:3:2 ARGV[0]: 'XREAD'
[2020-01-22 06:56:24] Req. 1:3:2 ARGV[1]: 'BLOCK'
[2020-01-22 06:56:24] Req. 1:3:2 ARGV[2]: '0'
[2020-01-22 06:56:24] Req. 1:3:2 ARGV[3]: 'STREAMS'
[2020-01-22 06:56:24] Req. 1:3:2 ARGV[4]: '_health::sms_health_check_channel_e25ec0a1'
[2020-01-22 06:56:24] Req. 1:3:2 ARGV[5]: '$'
[2020-01-22 06:56:24] Processing request 1:3:2
[2020-01-22 06:56:24] Disabling multiplexing for client 1:3
[2020-01-22 06:56:24] Connecting to node 172.27.240.190:6379
[2020-01-22 06:56:24] Write handler installed into request 1:3:2 for node 172.27.240.190:6379
[2020-01-22 06:56:24] Created Request 1:1:2
[2020-01-22 06:56:24] Parsing request 1:1:2, status: -1
[2020-01-22 06:56:24] Request 1:1:2 buffer:
*8
$4
XADD
$42
_health::sms_health_check_channel_e25ec0a1
$6
MAXLEN
$1
~
$4
1000
$1
*
$4
json
$183
{"action":"rtm/publish","body":{"channel":"sms_health_check_channel_e25ec0a1","message":{"device":{"game":"test","android_id":"2276d68aa153445eb8ec2e5e45e6fe49"},"magic":805}},"id":3}

[2020-01-22 06:56:24] Req. 1:1:2 ARGV[0]: 'XADD'
[2020-01-22 06:56:24] Req. 1:1:2 ARGV[1]: '_health::sms_health_check_channel_e25ec0a1'
[2020-01-22 06:56:24] Req. 1:1:2 ARGV[2]: 'MAXLEN'
[2020-01-22 06:56:24] Req. 1:1:2 ARGV[3]: '~'
[2020-01-22 06:56:24] Req. 1:1:2 ARGV[4]: '1000'
[2020-01-22 06:56:24] Req. 1:1:2 ARGV[5]: '*'
[2020-01-22 06:56:24] Req. 1:1:2 ARGV[6]: 'json'
[2020-01-22 06:56:24] Req. 1:1:2 ARGV[7]: '{"action":"rtm/publish","body":{"channel":"sms_health_check_channel_e25ec0a1","message":{"device":{"game":"test","android_id":"2276d68aa153445eb8ec2e5e45e6fe49"},"magic":805}},"id":3}'
[2020-01-22 06:56:24] Processing request 1:1:2
[2020-01-22 06:56:24] Request 1:1:2 written to node 172.27.240.190:6379, adding it to pending requests
[2020-01-22 06:56:24] Node 172.27.240.190:6379[thread 1] -> requests pending: [1:1:2]
[2020-01-22 06:56:24] Still have 0 request(s) to send to node 172.27.240.190:6379  on thread 1
[2020-01-22 06:56:24] Read reply handler installed for node 172.27.240.190:6379
[2020-01-22 06:56:24] Connected to node 172.27.240.190:6379 (private connection for client 1:3
[2020-01-22 06:56:24] Request 1:3:2 written to node 172.27.240.190:6379, adding it to pending requests
[2020-01-22 06:56:24] Node 172.27.240.190:6379[thread 1] -> requests pending: [1:3:2]
[2020-01-22 06:56:24] Still have 0 request(s) to send to node 172.27.240.190:6379 on private connection owned by 1:3
[2020-01-22 06:56:24] Reading reply from 172.27.240.190:6379 on thread 1...
[2020-01-22 06:56:24] Reply read complete for request 1:1:2, OK!
[2020-01-22 06:56:24] 
Reply for request 1:1:2:
$15
1579676184684-0

[2020-01-22 06:56:24] Writing reply for request 1:1:2 to client buffer...
[2020-01-22 06:56:24] Node 172.27.240.190:6379[thread 1] -> requests pending: []
[2020-01-22 06:56:24] Free Request 1:1:2
artix75 commented 4 years ago

@bsergean Yes, the connections pool is enabled by default, but it doesn't seem to be active from what I read in your log.

Please ensure that:

If connections pool is active, you should read its size with proxy config get connections-pool-size

bsergean commented 4 years ago

Ok I double checked I'm on the right branch, and since I'm using docker I'm pretty sure my build instructions do a clean build.

Now I see this message:

[2020-01-22 16:50:03] Connections pool size: 10 (respawn 2 every 50ms if below 10)

In the 'incipit'.

[2020-01-22 16:50:03] Redis Cluster Proxy v999.999.999 (unstable)
[2020-01-22 16:50:03] Commit: (00000000/0)
[2020-01-22 16:50:03] Cluster Address: 172.16.18.181:6379
[2020-01-22 16:50:03] PID: 1
[2020-01-22 16:50:03] OS: Linux 3.10.0-957.10.1.el7.x86_64 x86_64
[2020-01-22 16:50:03] Bits: 64
[2020-01-22 16:50:03] Log level: debug
[2020-01-22 16:50:03] Connections pool size: 10 (respawn 2 every 50ms if below 10)
[2020-01-22 16:50:03] The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[2020-01-22 16:50:03] Not listening to IPv6: unsupported
[2020-01-22 16:50:03] Listening on *:7777
[2020-01-22 16:50:03] Starting 8 threads...
[2020-01-22 16:50:03] Creating thread 0...
[2020-01-22 16:50:03] Fetching cluster configuration...

Similar to when I was using the debugging disable multi-plexing option: it looks like

  1. Very first time I get a hang
  2. But I cancel and retry, then 'things works. As if the cache is 'ready'/'warm'. The pool is ready to server a 'hot' connection, and things work fine. So your branch definitely makes things better.

I wonder if something could be done to fix the 'first ever case', or if this won't be a problem in practice.

artix75 commented 4 years ago

@bsergean I need to understand the 'first ever case' scenario and what really happens under the hood. Could you attach a complete log?

Since the buffer output is not needed for this kind of issue and logs with --dump-buffer are usually very big, you can omit --dump-buffer to have a lighter log, ie.

redis-cluster-proxy --log-level debug --dump-queries --dump-queues

bsergean commented 4 years ago

I'm trying to get you repro steps using cobra, the python server/client I'm working on.

In tha process I was starting/shutting down my redis cluster, while the proxy is still running, and noticed that redis-cli hangs when that happens. (I'm using the connection pool branch)

$ redis-cli -p 7777
127.0.0.1:7777> set foo bra OK 127.0.0.1:7777> 127.0.0.1:7777> 127.0.0.1:7777> set foo bra (error) ERR Cluster node disconnected: 127.0.0.1:10006 127.0.0.1:7777> 127.0.0.1:7777> 127.0.0.1:7777> set foo bra (error) ERR Cluster node disconnected: 127.0.0.1:10006 127.0.0.1:7777> 127.0.0.1:7777> 127.0.0.1:7777> set foo bar

.... here I have restarted the cluster, so I expect the proxy to be able to reach it now. .... but redis-cli is hanged.

On Jan 23, 2020, at 7:21 AM, artix notifications@github.com wrote:

@bsergean https://github.com/bsergean I need to understand the 'first ever case' scenario and what really happens under the hood. Could you attach a complete log?

Since the buffer output is not needed for this kind of issue and logs with --dump-buffer are usually very big, you can omit --dump-buffer to have a lighter log, ie.

redis-cluster-proxy --log-level debug --dump-queries --dump-queues

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/artix75/redis-cluster-proxy/issues/18?email_source=notifications&email_token=AC2O6UJPWGVJRJBPCZ2AFA3Q7GYZDA5CNFSM4KHNGB7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXW7IY#issuecomment-577728419, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UP44QTQ5AJBEBYDS73Q7GYZDANCNFSM4KHNGB7A.

artix75 commented 4 years ago

@bsergean I made some tests executing the XREAD(blocking)/XADD sequence from different clients. Using the unstable branch it often hangs during the blocking XREAD command, since the client calling XREAD actually reconnects with a private connection and the blocking XREAD is often executed after the XADD command.

Using the connections_pool branch, the problem is highly mitigated, anyway, with many clients, it sometimes hangs too since the connections pool runs out of available connections.

It's possible to increase the pool's size and/or change the "respawn" criteria. When the number of connections in the pool drops below a defined level, the thread starts "respawning" new connections with the defined interval (in milliseconds) and the defined rate (ie. two connections every 50ms).

By default, the pool's size is 10 connections per thread (created when the proxy starts), the minimum size is 10, the default spawning interval is 50ms and the default spawning rate is 2. That means that every thread creates 10 connections at start, and it starts respawning new connections after the number of connections drops below 10, spawning 2 new connections every 50ms, until the pool is full again (10 connections).

You can fine-tune these parameters by using the following options:

So, ie.

redis-cluster-proxy --connections-pool-size 50 --connections-poo-min-size 40 --connections-pool-spawn-every 2 --connections-pool-spawn-rate 5

It means: "create a pool of 50 connections and start respawning 5 connections every 2ms when connections drop below 40"

By fine-tuning those parameters, I finally managed to get all my tests executed without hangs on the blocking XREAD commands. You can give it a try if you want. Let me know if this improves the execution of your scripts.

bsergean commented 4 years ago

Thanks a lot for the very detailed explanation. The connection pool work highly mitigates this problem which I hope will help adoption.

Based on your input I have changed my scripts to XADD first and then XREAD, which sound way more reliable.