Netflix / dynomite

A generic dynamo implementation for different k-v storage engines
Apache License 2.0
4.2k stars 531 forks source link

Errors on read/write: -ERR Peer: Unknown #482

Open Joseph-R opened 7 years ago

Joseph-R commented 7 years ago

Issue: I see -ERR Peer: Unknown when writing to a Dynomite cluster.

Test method: Manual reads and writes via CLI to Dynomite ports, or testing sereis via redis-benchmark with random k:v pairs.

Versions:

Dynomite: dynomite-v0.5.9-2_ProxyCloseFix
Redis server:  3.0.6
Redis tools: 3.0.6

Redis-cli output (4-node cluster behind an AWS ELB):

~$ /usr/local/Cellar/redis/4.0.1/bin/redis-cli -h internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com -p 8102
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> GET test
"testy"
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> SET test test2
OK
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> GET test
"test2"
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> SET dog chewy
(error) ERR Peer: Unknown error -1
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> SET dogtoy myglasses
(error) ERR Peer: Unknown error -1

Example logs - dynomite.log (log level 9):

00000000  2d 45 52 52 20 50 65 65  72 3a 20 55 6e 6b 6e 6f   |-ERR Peer: Unkno|
00000010  77 6e 20 65 72 72 6f 72  20 2d 31 0d 0a            |wn error -1..|
[2017-09-20 22:07:08.898] msg_send_chain:1133 About to dump out the content of msg
[2017-09-20 22:07:08.898] msg_dump:645 msg dump id 1387871 request 0 len 29 type 25 done 0 error 0 (err 0)
[2017-09-20 22:07:08.898] msg_dump:655 mbuf with 29 bytes of data
00000000  2d 45 52 52 20 50 65 65  72 3a 20 55 6e 6b 6e 6f   |-ERR Peer: Unkno|
00000010  77 6e 20 65 72 72 6f 72  20 2d 31 0d 0a            |wn error -1..|
[2017-09-20 22:07:08.906] msg_send_chain:1133 About to dump out the content of msg
[2017-09-20 22:07:08.906] msg_dump:645 msg dump id 1387875 request 0 len 29 type 25 done 0 error 0 (err 0)
[2017-09-20 22:07:08.906] msg_dump:655 mbuf with 29 bytes of data
00000000  2d 45 52 52 20 50 65 65  72 3a 20 55 6e 6b 6e 6f   |-ERR Peer: Unkno|
00000010  77 6e 20 65 72 72 6f 72  20 2d 31 0d 0a            |wn error -1..|
[2017-09-20 22:07:08.906] msg_send_chain:1133 About to dump out the content of msg
[2017-09-20 22:07:08.906] msg_dump:645 msg dump id 1387879 request 0 len 29 type 25 done 0 error 0 (err 0)
[2017-09-20 22:07:08.906] msg_dump:655 mbuf with 29 bytes of data
00000000  2d 45 52 52 20 50 65 65  72 3a 20 55 6e 6b 6e 6f   |-ERR Peer: Unkno|
00000010  77 6e 20 65 72 72 6f 72  20 2d 31 0d 0a            |wn error -1..|
[2017-09-20 22:07:08.906] msg_send_chain:1133 About to dump out the content of msg
[2017-09-20 22:07:08.907] msg_dump:645 msg dump id 1387883 request 0 len 29 type 25 done 0 error 0 (err 0)
[2017-09-20 22:07:08.907] msg_dump:655 mbuf with 29 bytes of data
00000000  2d 45 52 52 20 50 65 65  72 3a 20 55 6e 6b 6e 6f   |-ERR Peer: Unkno|
00000010  77 6e 20 65 72 72 6f 72  20 2d 31 0d 0a            |wn error -1..|
[2017-09-20 22:07:08.949] msg_send_chain:1133 About to dump out the content of msg
[2017-09-20 22:07:08.949] msg_dump:645 msg dump id 1387884 request 1 len 45 type 55 done 0 error 0 (err 0)
[2017-09-20 22:07:08.949] msg_dump:655 mbuf with 45 bytes of data

dynomite.yml:

root@ip-10-10-7-135:/mnt/dynomite-manager# cat dynomite.yml
dyn_o_mite:
  dyn_listen: 0.0.0.0:8101
  data_store: 0
  listen: 0.0.0.0:8102
  dyn_seed_provider: florida_provider
  servers:
  - 127.0.0.1:22122:1
  auto_eject_hosts: true
  distribution: vnode
  gos_interval: 10000
  hash: murmur
  preconnect: true
  server_retry_timeout: 30000
  timeout: 5000
  tokens: '2147483647'
  rack: us-east-1a
  secure_server_option: datacenter
  datacenter: us-east-1
  read_consistency: DC_ONE
  write_consistency: DC_ONE
  pem_key_file: /mnt/dynomite-manager/dynomite.pem
  recon_key_file: /mnt/dynomite-manager/recon_key.pem
  recon_iv_file: /mnt/dynomite-manager/recon_iv.pem
  dyn_seeds:
  - 10.10.2.187:8101:us-east-1c:us-east-1:0
  - 10.10.5.233:8101:us-east-1a:us-east-1:0
  - 10.10.2.87:8101:us-east-1c:us-east-1:2147483647

Additional notes:

shailesh33 commented 7 years ago

Do you have the dynomite logs? I see you have only Redis logs. Dynomite should output if there were errors in connection to its peers.

Joseph-R commented 7 years ago

Hey @shailesh33 - I posted some sample dynomite logs above, at log level 9. They weren't super clearly marked, which I have gone back and fixed.

But basically there's not much helpful in the dynomite logs that I'm seeing. Just that -ERR Peer: Unknown error -1.

shailesh33 commented 7 years ago

at log 9 you should have much details logs. For example take this:

Try to trace the message id 1387871 and check its history in the logs. Try to get other related messages. Best, do you have particular request that is failing and try to get the logs starting from that request till its failure?

Joseph-R commented 7 years ago

Hey @shailesh33 - Appreciate the questions, and sorry for the late response. I have been OOO this week.

Pretty sure that was log level 9, but I will wipe the logs and try to provide an example here tracing one specific message id through.

Joseph-R commented 7 years ago

Wiped log files and set up a new session, bumping the log level up to 11.

root@ip-10-10-2-87:/mnt/log# ls
core  dynomite.log  dynomite-manager.log  redis-sentinel.log  redis-server.log  redis-server.log.1.gz  redis-server.log.2.gz

# Search all redis, dynomite, and dynomite-manager logs for lines referencing message ID 552459 +/- four lines before/after.
root@ip-10-10-2-87:/mnt/log# ag -C4 552459
dynomite.log
11203-[2017-10-02 21:53:40.267] msg_dump:645 msg dump id 552458 request 1 len 14 type 87 done 0 error 0 (err 0)
11204-[2017-10-02 21:53:40.267] msg_dump:655 mbuf with 14 bytes of data
11205-00000000  2a 31 0d 0a 24 34 0d 0a  50 49 4e 47 0d 0a         |*1..$4..PING..|
11206-[2017-10-02 21:53:40.267] msg_send_chain:1133 About to dump out the content of msg
11207:[2017-10-02 21:53:40.267] msg_dump:645 msg dump id 552459 request 0 len 7 type 130 done 0 error 0 (err 0)
11208-[2017-10-02 21:53:40.267] msg_dump:655 mbuf with 7 bytes of data
11209-00000000  2b 50 4f 4e 47 0d 0a                               |+PONG..|
11210-[2017-10-02 21:53:40.267] client_unref_internal_try_put:95 <CLIENT 0x1ad3550 -1 from '127.0.0.1:49914'> unref owner <POOL 0x1ac2080 'dyn_o_mite'>
11211-[2017-10-02 21:53:46.782] client_unref_internal_try_put:95 <CLIENT 0x1ad3550 -1 from '10.10.2.236:14537'> unref owner <POOL 0x1ac2080 'dyn_o_mite'>

Entire log file from new session (one host): dynomite_log_export.txt

KowalczykBartek commented 7 years ago

@Joseph-R are you able to reproduce this problem locally ? can you provide us all of 4 dynomite.yml configuration files ?

Joseph-R commented 7 years ago

@KowalczykBartek - Please see the 4 dynomite.yml config files, attached.

10-10-7-135.dynomite.yml.txt 10-10-2-187.dynomite.yml.txt 10-10-2-87.dynomite.yml.txt 10-10-5-233.dynomite.yml.txt

are you able to reproduce this problem locally ?

Not sure what you mean by that? I can reproduce it while calling to the cluster from a local client through the ELB, and by commands executed directly on one host and directed at the ports dynomite is listening on.

drewbeer commented 7 years ago

so i just went though a full upgrade to 0.5.9 to see if it would solve the proxy close issue for me, however it created this issue. after a bunch of troubleshooting. I determined i had two issues, one replication was broken, and two, if any node was not using 0 as the starting token, it gave me

(error) ERR Peer: Unknown error -1

after i reverted back, the error went away. maybe that will help.

Joseph-R commented 7 years ago

and two, if any node was not using 0 as the starting token, it gave me (error) ERR Peer: Unknown error -1

after i reverted back, the error went away. maybe that will help.

@drewbeer - So you fixed the issue by reverting back to an earlier version of Dynomite?

Joseph-R commented 7 years ago

Still seeing this issue after wiping out the original test cluster and deploying a new cluster with the dynomite-v0.5.9-4_DecryptionCrash.

Versions:

Dynomite: dynomite-v0.5.9-2_ProxyCloseFix
Redis server:  3.0.6
Redis tools: 3.0.6

Redis-cli output (4-node cluster, behind ELB on AWS):

~$ /usr/local/Cellar/redis/4.0.2/bin/redis-cli -p 8102 -h internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get test
(error) ERR Peer: Unknown error -1
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get test
(error) ERR Peer: Unknown error -1
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get test
(nil)
(0.51s)
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get test
(nil)
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get test
(nil)
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get test
(nil)
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> set test testy
OK
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> set booga boo
OK
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get test
"testy"
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get booga
"boo"
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> set poke mon
OK
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get poke
"mon"
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> set bowl poke
(error) ERR Peer: Unknown error -1
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> get bowl
(error) ERR Peer: Unknown error -1
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> set bowl poke
(error) ERR Peer: Unknown error -1
internal-jreid-redis-dev-949117426.us-east-1.elb.amazonaws.com:8102> set bowl poke
(error) ERR Peer: Unknown error -1

Further documentation:

shailesh33 commented 7 years ago

@Joseph-R So to confirm, this worked with dynomite-v0.5.9-2_ProxyCloseFix and not working with dynomite-v0.5.9-4_DecryptionCrash ?

Joseph-R commented 7 years ago

@shailesh33 - I wish. Unfortunately no, we haven't been able to get either dynomite-v0.5.9-2_ProxyCloseFix or dynomite-v0.5.9-4_DecryptionCrash to work for us.

shailesh33 commented 7 years ago

I tried to reproduce the issue but it is not helping. I created a single rack cluster with 2 nodes with the two token 0 & 2147483647. It seems straightforward to reproduce. I am using the dev branch.

Can you do this, take one of the dynomite nodes, keep log level 9 and then use redis-cli to run those commands that you saw failing. you should see logs like these:

[2017-11-02 14:26:26.124] req_forward:860 >>>>>>>>>>>>>>>>>>>>>>> <CONN_CLIENT 0x1d4e080 16 from '127.0.0.1:59996'> RECEIVED <REQ 0x1d4e2b0 1:0::0 REQ_REDIS_GET, len:23> key 'bowl' tagged key 'bowl'
[2017-11-02 14:26:26.125] _msg_get:319 alloc_msg_count: 2 caller: rsp_get <CONN_SERVER 0x1d3fd00 9 to '127.0.0.1:22122:1'>
[2017-11-02 14:26:26.126] server_rsp_forward:834 <CONN_CLIENT 0x1d4e080 16 from '127.0.0.1:59996'> <REQ 0x1d4e2b0 1:0::0 REQ_REDIS_GET, len:23> RECEIVED <RSP 0x1d4e540 2:0 RSP_REDIS_BULK len:10>
[2017-11-02 14:26:26.126] msg_local_one_rsp_handler:1007 <REQ 0x1d4e2b0 1:0::0 REQ_REDIS_GET, len:23> SELECTED <RSP 0x1d4e540 2:0 RSP_REDIS_BULK len:10>

heres how I interpret the logs: the request enters the coordinator node and first log it has is >>>>>> in it. There it says which client connection received which request. In the example above, <REQ 0x1d4e2b0 1:0::0 REQ_REDIS_GET, len:23>, in which 1:0::0 says request it is 1, parent id is 0 and fragment id is 0. If this message creates a clone to forward to a peer, that cloned message will have a new request id and parent ID would be 1 (the request id of its parent). if you follow this, you will get a flow of the messages for a given request. The logs that you uploaded till now do not have any of these log lines so I am wondering what is the issue.