Kafka issue during a worker instance outage

jazzl0ver commented 6 years ago

Hello,

Some time ago an alert came from our monitoring system that showed kafka service is not available. I looked at EC2 console and found that one of 3 firecamp brokers has alarm for Instance Status Checks. Wondering why that led to completely inaccessible Kafka service.

Here is how Kafka is checked from the monitoring host:

# /bin/docker run --rm harisekhon/cassandra-dev check_kafka.pl -B kafka-uat-0.firecamp-uat-firecamp.com:9092,kafka-uat-1.firecamp-uat-firecamp.com:9092,kafka-uat-2.firecamp-uat-firecamp.com:9092 -T testtopic -vvv
verbose mode on

check_kafka.pl version 0.3  =>  Hari Sekhon Utils version 1.18.9

broker host:              kafka-uat-0.firecamp-uat-firecamp.com
broker port:              9092
broker host:              kafka-uat-1.firecamp-uat-firecamp.com
broker port:              9092
broker host:              kafka-uat-2.firecamp-uat-firecamp.com
broker port:              9092
host:                     kafka-uat-0.firecamp-uat-firecamp.com
port:                     9092
topic:                    testtopic
required acks:            1
send-max-attempts:        1
receive-max-attempts:     1
retry-backoff:            200
sleep:                    0.5

setting timeout to 10 secs

connecting to Kafka brokers kafka-uat-0.firecamp-uat-firecamp.com:9092,kafka-uat-1.firecamp-uat-firecamp.com:9092,kafka-uat-2.firecamp-uat-firecamp.com:9092
CRITICAL: Error: Cannot get metadata: topic='<undef>'

Trace begun at /usr/local/share/perl5/site_perl/Kafka/Connection.pm line 1592
Kafka::Connection::_error('Kafka::Connection=HASH(0x55caf194e5a0)', -1007, 'topic=\'<undef>\'') called at /usr/local/share/perl5/site_perl/Kafka/Connection.pm line 693
Kafka::Connection::get_metadata('Kafka::Connection=HASH(0x55caf194e5a0)') called at /github/nagios-plugins/check_kafka.pl line 257
main::__ANON__ at /github/nagios-plugins/lib/HariSekhonUtils.pm line 565
eval {...} at /github/nagios-plugins/lib/HariSekhonUtils.pm line 565
HariSekhonUtils::try('CODE(0x55caf19559d8)') called at /github/nagios-plugins/check_kafka.pl line 383

kafka-uat.log.gz zoo-uat.log.gz Zookeeper and Kafka logs attached.

After some time it's all got back to working state, but no Kafka service worked during ~15 minutes.

Please, take a look and let me know if you need anything else.

JuniusLuo commented 6 years ago

Could you please share more details about the env? which release it is running, the node's instance type. looks the system has 3 nodes on 3 AZs? zookeeper and kafka run on the same node. what are the configured heap size for kafka and zookeeper?

jazzl0ver commented 6 years ago

I installed the "latest", no ideas how to check which date it is of.
Nodes are t2.medium
Yes, 3 nodes on 3 AZs
Correct, Zookeeper and Kafka are on the same node (as well as Cassandra and KafkaManager). All were limited to 512MB of heap.

JuniusLuo commented 6 years ago

512MB is not enough for Kafka if you add some load. Also too low for Cassandra.

there were lots of socket in zookeeper logs and lots of leader election happened for kafka. This is likely caused by java gc, as 512MB is not enough for Kafka. Please increase the heap size for Kafka, zookeeper and cassandra.

grep "Closed socket" zoo-uat.log | wc -l 4135 grep "leader elect" kafka-uat.log | wc -l 696

JuniusLuo commented 6 years ago

Please also don't run again latest unless for the quick test. The latest image will keep getting updated and may not be stable. and upgrade is not supported.

jazzl0ver commented 6 years ago

I agree and I'll follow you suggestion - it's already in my plans to re-install everything with 0.9.5. My question was rather about the overall situation: the entire Kafka service stopped responding after a worker has failed. This is not a behavior people expect from the high available service. In theory, a larger heap size might be exhausted as well, right?

JuniusLuo commented 6 years ago

It is not one node. It happened for all 3 nodes. You could check zk logs. The connection broken happened for all 3 nodes.

^[[34m2018-04-19T16:43:11.296Z^[[0m 2018-04-19 16:43:08,807 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.22.2.59:40612 (no session established for client) ^[[34m2018-04-19T16:43:11.296Z^[[0m 2018-04-19 16:43:08,817 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.22.1.89:40424 (no session established for client) ^[[34m2018-04-19T16:43:11.303Z^[[0m 2018-04-19 16:43:08,929 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for client /172.22.5.185:43768 (no session established for client)

grep "Closed socket" zoo-uat.log > 1 grep 2.59 1 | wc -l 705 grep 1.89 1 | wc -l 2618 grep 5.185 1 | wc -l 824

jazzl0ver commented 6 years ago

oh.. i see then.. thank you!!

JuniusLuo commented 6 years ago

You could also wait for 0.9.6 release. We decided to do a refactoring before the v1 release. The upgrade will be broken. If you want to keep the old data, you will need to delete the existing services, create the new services, and using the volume-replace tool to replace the new volumes with the old volumes.

jazzl0ver commented 6 years ago

Thank you very much for the heads up! I'll wait for 0.9.6 release then.

jazzl0ver commented 6 years ago

Hi. Just wondering if the release 0.9.6 is upgradable to the further releases without a need to recreate things from scratch?

cloudstax commented 6 years ago

yes, release 0.9.6 will be upgradable to the further releases for the existing services. no need to recreate.

cloudstax / firecamp

Kafka issue during a worker instance outage #54