Closed jazzl0ver closed 6 years ago
Could you please share more details about the env? which release it is running, the node's instance type. looks the system has 3 nodes on 3 AZs? zookeeper and kafka run on the same node. what are the configured heap size for kafka and zookeeper?
512MB is not enough for Kafka if you add some load. Also too low for Cassandra.
there were lots of socket in zookeeper logs and lots of leader election happened for kafka. This is likely caused by java gc, as 512MB is not enough for Kafka. Please increase the heap size for Kafka, zookeeper and cassandra.
grep "Closed socket" zoo-uat.log | wc -l 4135 grep "leader elect" kafka-uat.log | wc -l 696
Please also don't run again latest unless for the quick test. The latest image will keep getting updated and may not be stable. and upgrade is not supported.
I agree and I'll follow you suggestion - it's already in my plans to re-install everything with 0.9.5. My question was rather about the overall situation: the entire Kafka service stopped responding after a worker has failed. This is not a behavior people expect from the high available service. In theory, a larger heap size might be exhausted as well, right?
It is not one node. It happened for all 3 nodes. You could check zk logs. The connection broken happened for all 3 nodes.
^[[34m2018-04-19T16:43:11.296Z^[[0m 2018-04-19 16:43:08,807 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.22.2.59:40612 (no session established for client) ^[[34m2018-04-19T16:43:11.296Z^[[0m 2018-04-19 16:43:08,817 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.22.1.89:40424 (no session established for client) ^[[34m2018-04-19T16:43:11.303Z^[[0m 2018-04-19 16:43:08,929 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.22.5.185:43768 (no session established for client)
grep "Closed socket" zoo-uat.log > 1 grep 2.59 1 | wc -l 705 grep 1.89 1 | wc -l 2618 grep 5.185 1 | wc -l 824
oh.. i see then.. thank you!!
You could also wait for 0.9.6 release. We decided to do a refactoring before the v1 release. The upgrade will be broken. If you want to keep the old data, you will need to delete the existing services, create the new services, and using the volume-replace tool to replace the new volumes with the old volumes.
Thank you very much for the heads up! I'll wait for 0.9.6 release then.
Hi. Just wondering if the release 0.9.6 is upgradable to the further releases without a need to recreate things from scratch?
yes, release 0.9.6 will be upgradable to the further releases for the existing services. no need to recreate.
Hello,
Some time ago an alert came from our monitoring system that showed kafka service is not available. I looked at EC2 console and found that one of 3 firecamp brokers has alarm for Instance Status Checks. Wondering why that led to completely inaccessible Kafka service.
Here is how Kafka is checked from the monitoring host:
kafka-uat.log.gz zoo-uat.log.gz Zookeeper and Kafka logs attached.
After some time it's all got back to working state, but no Kafka service worked during ~15 minutes.
Please, take a look and let me know if you need anything else.