cloudstax / firecamp

Serverless Platform for the stateful services
https://www.cloudstax.io
Apache License 2.0
208 stars 21 forks source link

zookeeper error #82

Closed jazzl0ver closed 5 years ago

jazzl0ver commented 5 years ago

Hi,

This happens sometimes w/o affecting the running systems (from the 1st glance), but I wanted to find the roots of the issue just in case. So, here is what's happening:

  1. An error appeared in the Zookeeper log (hope I haven't violated the recommended size of quote):

    2018-11-02 13:05:46,465 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /172.31.22.67:58476
    2018-11-02 13:05:46,472 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client attempting to establish new session at /172.31.22.67:58476
    2018-11-02 13:05:46,474 [myid:3] - INFO [CommitProcessor:3:ZooKeeperServer@683] - Established session 0x3000011b18c1134 with negotiated timeout 30000 for client /172.31.22.67:58476
    2018-11-02 13:05:50,212 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@487] - Processed session termination for sessionid: 0x3000011b18c1134
    2018-11-02 13:05:50,213 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.31.22.67:58476 which had sessionid 0x3000011b18c1134
    2018-11-02 13:06:47,297 [myid:3] - INFO [SessionTracker:ZooKeeperServer@354] - Expiring session 0x2000011a84d055a, timeout of 6000ms exceeded
    2018-11-02 13:06:47,297 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@487] - Processed session termination for sessionid: 0x2000011a84d055a
    2018-11-02 13:06:51,089 [myid:3] - ERROR [LearnerHandler-/172.31.28.162:48280:LearnerHandler@648] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    at java.io.DataInputStream.readInt(DataInputStream.java:387)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
    2018-11-02 13:06:51,090 [myid:3] - WARN [LearnerHandler-/172.31.28.162:48280:LearnerHandler@661] - ******* GOODBYE /172.31.28.162:48280 ********
    2018-11-02 13:06:51,090 [myid:2] - WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@90] - Exception when following the leader
    java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    at java.io.DataInputStream.readInt(DataInputStream.java:387)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
    at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:981)
    2018-11-02 13:06:51,091 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@169] - shutdown called
    java.lang.Exception: shutdown Follower
    at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:985)
    2018-11-02 13:06:51,091 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.31.42.207:52628 which had sessionid 0x30000113b1e0001
    2018-11-02 13:06:51,091 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.31.42.207:52632 which had sessionid 0x30000113b1e0000
    2018-11-02 13:06:51,091 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.31.28.162:46312 which had sessionid 0x2000011a84d055a
    2018-11-02 13:06:51,092 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.31.42.207:52630 which had sessionid 0x2000011a84d0000
    2018-11-02 13:06:51,092 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down
    2018-11-02 13:06:51,092 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@501] - shutting down
    2018-11-02 13:06:51,093 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - Shutting down
    2018-11-02 13:06:51,093 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
    2018-11-02 13:06:51,093 [myid:2] - INFO [CommitProcessor:2:CommitProcessor@153] - CommitProcessor exited loop!
    2018-11-02 13:06:51,093 [myid:2] - INFO [FollowerRequestProcessor:2:FollowerRequestProcessor@97] - FollowerRequestProcessor exited loop!
    2018-11-02 13:06:51,093 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@403] - shutdown of request processor complete
    2018-11-02 13:06:51,093 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - Shutting down
    2018-11-02 13:06:51,094 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@602] - Notification: 1 (message format version), 2 (n.leader), 0xd00009fac (n.zxid), 0xcb (n.round), LOOKING (n.state), 2 (n.sid), 0xd (n.peerEpoch) FOLLOWING (my state)
    2018-11-02 13:06:51,093 [myid:2] - INFO [SyncThread:2:SyncRequestProcessor@186] - SyncRequestProcessor exited!
    2018-11-02 13:06:51,094 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@909] - LOOKING
    2018-11-02 13:06:51,094 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@820] - New election. My id = 2, proposed zxid=0xd00009fac
    2018-11-02 13:06:51,095 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@602] - Notification: 1 (message format version), 2 (n.leader), 0xd00009fac (n.zxid), 0xcb (n.round), LOOKING (n.state), 2 (n.sid), 0xd (n.peerEpoch) LOOKING (my state)
    2018-11-02 13:06:51,295 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@854] - Notification time out: 400
    2018-11-02 13:06:51,295 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@602] - Notification: 1 (message format version), 2 (n.leader), 0xd00009fac (n.zxid), 0xcb (n.round), LOOKING (n.state), 2 (n.sid), 0xd (n.peerEpoch) LOOKING (my state)
    2018-11-02 13:06:51,296 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@602] - Notification: 1 (message format version), 2 (n.leader), 0xd00009fac (n.zxid), 0xcb (n.round), LOOKING (n.state), 2 (n.sid), 0xd (n.peerEpoch) FOLLOWING (my state)
    2018-11-02 13:06:51,695 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@854] - Notification time out: 800
    2018-11-02 13:06:51,696 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@602] - Notification: 1 (message format version), 2 (n.leader), 0xd00009fac (n.zxid), 0xcb (n.round), LOOKING (n.state), 2 (n.sid), 0xd (n.peerEpoch) LOOKING (my state)
    2018-11-02 13:06:51,725 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /172.31.42.207:38180
  2. I've opened /var/log/docker on the worker instance and found the following:

    time="2018-11-02T13:03:09Z" level=info msg=/var/log/firecamp/firecamp-dockervolume.ip-172-31-4-202.root.log.INFO.20181011-095304.7 plugin=6c185ae2653a3e8f634985b7cb71ddfb78b6739c5cd09c8683a4e7d0d569cb57
    time="2018-11-02T13:06:51Z" level=error msg="time=\"2018-11-02T13:06:50Z\" level=error msg=\"Failed to put log events\" errorCode=InvalidParameterException logGroupName=firecamp-prod-kafka-prod-7c0de3cd352f42174e1fbabe7a0da58f logStreamName=\"kafka-prod-2/ip-172-31-4-202/d57fcf1d24a7\" message=\"Upload too large: 1048738 bytes exceeds limit of 1048576\" origError=<nil> " plugin=201ccf795e4d47d35ca8d409d0f434e1ebc7a9c4e402a3008d3e9549d6b06c5e
    time="2018-11-02T13:06:51Z" level=error msg="time=\"2018-11-02T13:06:51Z\" level=error msg=\"InvalidParameterException: Upload too large: 1048738 bytes exceeds limit of 1048576\\n\\tstatus code: 400, requestid: 299e64b8-dea0-11e8-a50e-4f30b752b5a4\" " plugin=201ccf795e4d47d35ca8d409d0f434e1ebc7a9c4e402a3008d3e9549d6b06c5e
    time="2018-11-02T13:13:09Z" level=info msg=/var/log/firecamp/firecamp-dockervolume.ip-172-31-4-202.root.log.INFO.20181011-095304.7 plugin=6c185ae2653a3e8f634985b7cb71ddfb78b6739c5cd09c8683a4e7d0d569cb57

Do you think these two events are connected? Why the upload was too large? Can this be controlled somehow to avoid such errors in future?

JuniusLuo commented 5 years ago

Not sure if they are related or not. For the docker log, not sure why the log event was too large. Looks firecamp volume plugin is 6c18. The log event failure was from plugin 201c. The plugin 201c is probably the log plugin that pushes kafka logs to CloudWatch. Looks like kafka itself is recording some large log. You may check the kafka forum.

cloudstax commented 5 years ago

Close this issue, as it is not related with firecamp.