HariSekhon / Nagios-Plugins

450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
https://www.linkedin.com/in/HariSekhon
Other
1.13k stars 503 forks source link

Cannot receive topic='nagios' #211

Open psdhami09 opened 5 years ago

psdhami09 commented 5 years ago

Hi Hari,

we are using perl nagios plugin in our environment. we have recently deployed it. however we have observed some blips in nagios trends for kafka brokers, we thought that plugin went critical because kafka is having issue, but that's not the case, these blips are coming frequntly and pluging reports below error:

State info: CRITICAL: Error: Cannot receive: topic='nagios'

Not sure if this is known error in plugin, could you please suggest

image

psdhami09 commented 5 years ago

Below are some more details:

define command{ command_name check_kafka command_line /usr/local/nagios/libexec/nagios-plugins/check_kafka.pl -H $HOSTADDRESS$ -P $ARG1$ -T $ARG2$ -R $ARG3$ }

check_command check_kafka!9092!nagios!ISR

psdhami09 commented 5 years ago

Zoom in for screen shot, which displays plugin error:

image

HariSekhon commented 5 years ago

Have you tried the Python version for comparison? It may yield a different error message as this is generated from the underlying library. I personally prefer the python version now.

psdhami09 commented 5 years ago

Hey Hari,

I have tried Python version, it doesn't show same error again but there is another status info, below is the same:

Status Info: Initial Service Pseudo-State

image

Could you please confirm if we need to worry about this message. If you can give some details like when this message appears, will be great help for us.

Thanks Pritpal

HariSekhon commented 5 years ago

Please run it on the command line with the -v -v -v switches to get full debug output and paste full output here. Might be worth doing for both the Perl and Python versions of check_kafka as they both support this for debug logging.

You can use anonymize.py from DevOps Python Tools repo if you want to redact your hostname/IP addresses from the text before pasting it here.

psdhami09 commented 5 years ago

Hi Hari,

Thanks for the response

Here is the output for Python version:

user@788252a7:/usr/local/nagios/libexec/nagios-plugins# ./check_kafka.py -v -v -v -H IBUS-ibus-1 -P 9092 -T nagios 2018-11-08 17:49:07,709 - cli.py__parse_timeout__:387 - DEBUG - getting $TIMEOUT value None 2018-11-08 17:49:07,709 - cli.py__parse_timeout__:397 - DEBUG - timeout not set, using default timeout 10 2018-11-08 17:49:07,710 - utils.pylog_option:2213 - INFO - timeout: 10 2018-11-08 17:49:07,710 - cli.pytimeout:254 - DEBUG - setting timeout to 10 secs 2018-11-08 17:49:07,710 - cli.pymain:159 - INFO - Hari Sekhon check_kafka.py version 0.5.2 => CLI version 0.3 => Utils version 0.11.5 2018-11-08 17:49:07,710 - cli.pymain:160 - INFO - https://github.com/harisekhon/nagios-plugins 2018-11-08 17:49:07,710 - cli.pymain:161 - INFO - verbose level: 3 (DEBUG) 2018-11-08 17:49:07,710 - utils.pylog_option:2213 - INFO - timeout: 10 2018-11-08 17:49:07,710 - cli.pymain:164 - DEBUG - setting timeout alarm (10) 2018-11-08 17:49:07,735 - utils.pylog_option:2213 - INFO - host:port: IBUS-ibus-1:9092 2018-11-08 17:49:07,735 - utils.pylog_option:2213 - INFO - brokers: IBUS-ibus-1:9092 2018-11-08 17:49:07,736 - utils.pylog_option:2213 - INFO - topic: nagios 2018-11-08 17:49:07,736 - check_kafka.pyprocess_partitions:207 - INFO - partition not specified, getting random partition 2018-11-08 17:49:08,843 - check_kafka.pyprocess_partitions:209 - INFO - selected partition 0 2018-11-08 17:49:08,843 - utils.pylog_option:2213 - INFO - partition: 0 2018-11-08 17:49:08,844 - utils.pylog_option:2213 - INFO - acks: 1 2018-11-08 17:49:08,844 - threshold.pyinit:50 - DEBUG - warning threshold simple = upper 2018-11-08 17:49:08,844 - threshold.pyinit:51 - DEBUG - warning threshold positive = True 2018-11-08 17:49:08,844 - threshold.pyinit:52 - DEBUG - warning threshold integer = True 2018-11-08 17:49:08,844 - threshold.pyinit:53 - DEBUG - warning threshold min = None 2018-11-08 17:49:08,844 - threshold.pyinit:54 - DEBUG - warning threshold max = None 2018-11-08 17:49:08,844 - threshold.py__parse_threshold__:72 - DEBUG - warning threshold given = '1' 2018-11-08 17:49:08,844 - threshold.py__parse_threshold__:106 - DEBUG - warning threshold upper boundary = 1.0 2018-11-08 17:49:08,845 - threshold.py__parse_threshold__:107 - DEBUG - warning threshold lower boundary = None 2018-11-08 17:49:08,845 - utils.pylog_option:2213 - INFO - warning: 1 2018-11-08 17:49:08,845 - threshold.pyinit:50 - DEBUG - critical threshold simple = upper 2018-11-08 17:49:08,845 - threshold.pyinit:51 - DEBUG - critical threshold positive = True 2018-11-08 17:49:08,845 - threshold.pyinit:52 - DEBUG - critical threshold integer = True 2018-11-08 17:49:08,845 - threshold.pyinit:53 - DEBUG - critical threshold min = None 2018-11-08 17:49:08,845 - threshold.pyinit:54 - DEBUG - critical threshold max = None 2018-11-08 17:49:08,845 - threshold.py__parse_threshold__:72 - DEBUG - critical threshold given = '2' 2018-11-08 17:49:08,845 - threshold.py__parse_threshold__:106 - DEBUG - critical threshold upper boundary = 2.0 2018-11-08 17:49:08,845 - threshold.py__parse_threshold__:107 - DEBUG - critical threshold lower boundary = None 2018-11-08 17:49:08,845 - utils.pylog_option:2213 - INFO - critical: 2 2018-11-08 17:49:08,845 - pubsub_nagiosplugin.pyrun:117 - INFO - subscribing 2018-11-08 17:49:09,263 - check_kafka.pysubscribe:273 - DEBUG - partition assignments: set([]) 2018-11-08 17:49:09,263 - check_kafka.pysubscribe:279 - DEBUG - assigning partition 0 to consumer 2018-11-08 17:49:09,264 - check_kafka.pysubscribe:282 - DEBUG - partition assignments: set([TopicPartition(topic='nagios', partition=0)]) 2018-11-08 17:49:09,264 - check_kafka.pysubscribe:284 - DEBUG - getting current offset 2018-11-08 17:49:09,320 - check_kafka.pysubscribe:292 - DEBUG - recorded starting offset '4576' 2018-11-08 17:49:09,320 - pubsub_nagiosplugin.pyrun:119 - INFO - publishing message "Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch 1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'" 2018-11-08 17:49:09,320 - check_kafka.pypublish:296 - DEBUG - creating producer 2018-11-08 17:49:09,722 - check_kafka.pypublish:308 - DEBUG - producer.send() 2018-11-08 17:49:09,722 - check_kafka.pypublish:315 - DEBUG - producer.flush() 2018-11-08 17:49:09,738 - pubsub_nagiosplugin.pyrun:124 - INFO - published in 0.418 secs 2018-11-08 17:49:09,738 - pubsub_nagiosplugin.pyrun:129 - INFO - consuming message 2018-11-08 17:49:09,738 - check_kafka.pyconsume:320 - DEBUG - consumer.seek(4576) 2018-11-08 17:49:09,738 - check_kafka.pyconsume:323 - DEBUG - consumer.poll(timeout_ms=4500.0) 2018-11-08 17:49:09,796 - check_kafka.pyconsume:325 - DEBUG - msg object returned: {TopicPartition(topic=u'nagios', partition=0): [ConsumerRecord(topic=u'nagios', partition=0, offset=4576, timestamp=1541699349722, timestamp_type=0, key='check_kafka.py-zUZ7gE8Gv6Sy2HrpXqnN', value="Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch 1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'", checksum=None, serialized_key_size=35, serialized_value_size=156)]} 2018-11-08 17:49:09,796 - pubsub_nagiosplugin.pyrun:133 - INFO - consumed in 0.058 secs 2018-11-08 17:49:09,796 - pubsub_nagiosplugin.pyrun:134 - INFO - consumed message = "Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch 1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'" 2018-11-08 17:49:09,796 - pubsub_nagiosplugin.pyend:156 - INFO - checking consumed message "Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'" == published message "Test message from Hari Sekhon check_kafka.py on host 78128252a773 at epoch 1541699347.71 (Thu Nov 8 17:49:07 2018) with random token 'XW4uSVmnK6NUfeWd0Xid'" OK: Kafka message published and consumed back successfully, published in 0.418 secs, consumed in 0.058 secs, total time = 0.951 secs | publish_time=0.418s;1;2 consume_time=0.058s;1;2 total_time=0.951s user@78152a7:/usr/local/nagios/libexec/nagios-plugins#

And here is for Perl Version:

user@8122a7:/usr/local/nagios/libexec/nagios-plugins# ./check_kafka.pl -v -v -v -H IBUS-ibus-1 -P 9092 -T nagios verbose mode on

check_kafka.pl version 0.3 => Hari Sekhon Utils version 1.19.2

host: IBUS-ibus-1 port: 9092 topic: nagios required acks: 1 send-max-attempts: 1 receive-max-attempts: 1 retry-backoff: 200 sleep: 0.5

setting timeout to 10 secs

connecting to Kafka broker at IBUS-ibus-1:9092

Metadata:

Kafka topic 'AC_ADAPTER_COMMAND_RESPONSE_SMS_FINCH' partitions: Partition: 0 Replicas: 1,2,3 ISR: 1,3,2 Leader: 1 Partition: 1 Replicas: 2,3,1 ISR: 3,2,1 Leader: 2 Partition: 2 Replicas: 3,1,2 ISR: 3,2,1 Leader: 3 Partition: 3 Replicas: 1,3,2 ISR: 1,3,2 Leader: 1 Partition: 4 Replicas: 2,1,3 ISR: 1,3,2 Leader: 2 Partition: 5 Replicas: 3,2,1 ISR: 3,2,1 Leader: 3 Partition: 6 Replicas: 1,2,3 ISR: 1,3,2 Leader: 1 Partition: 7 Replicas: 2,3,1 ISR: 3,2,1 Leader: 2

Kafka topic 'SP.CVPECUNOREQ' partitions: UNKNOWN: 'SP.CVPECUNOREQ' 'SP' field not found. API may have changed. Please try latest version from https://github.com/harisekhon/nagios-plugins, re-run on command line with -vvv and if problem persists paste full output from -vvv mode in to a ticket requesting a fix/update at https://github.com/harisekhon/nagios-plugins/issues/new user@78152a:/usr/local/nagios/libexec/nagios-plugins#

ethan-riskiq commented 5 years ago

Having the same issue with our production check, although it only appears to be happening on one host. Will paste debugging output here when I can reproduce it from the cli.

ethan-riskiq commented 5 years ago

verbose mode on

check_kafka.pl version 0.2.6  =>  Hari Sekhon Utils version 1.18.6

host:                     br01
port:                     6667
topic:                    nagios
partition:                0
required acks:            1
send-max-attempts:        1
receive-max-attempts:     1
retry-backoff:            200
sleep:                    0.5

setting timeout to 60 secs

connecting to Kafka broker at br01:6667
CRITICAL: failed to get metadata, broker offline or wrong port? (some deployments use 9092, some such as Hortonworks use 6667)

real    0m8.236s
user    0m0.361s
sys 0m0.049s

[root@mon1 ~]# time /usr/lib64/nagios/nagios-plugins/check_kafka.pl -v -v -H br01 -P 6667 --topic nagios
verbose mode on

host:                     br01
port:                     6667
topic:                    nagios
partition:                0
required acks:            1
send-max-attempts:        1
receive-max-attempts:     1
retry-backoff:            200
sleep:                    0.5

setting timeout to 10 secs

connecting to Kafka broker at br01:6667
connecting producer
connecting consumer
CRITICAL: Error: Can't get metadata: topic = 'nagios'

real    0m7.478s
user    0m0.374s
sys 0m0.041s```
ethan-riskiq commented 5 years ago

the port is definitely online, service is responsive

HariSekhon commented 5 years ago

@ethan-riskiq Did you try the python check_kafka.py plugin to see if it gives a more informative error than the Perl API is returning?

HariSekhon commented 5 years ago

Also, I think you should use one more -v switch, three levels of verbose will get debug output including more output from the API.

ethan-riskiq commented 5 years ago

I have not been a able to reproduce the issue via the python version of the script. I've added a "-new" check that's using the python verison of the script and will see if similar issues occur when the perl script returns this error.

ethan-riskiq commented 5 years ago

https://gist.github.com/ethan-riskiq/25a2168b8143c8a59c807c41344154dc gist of python debug output. got UnknownError: failed to find matching consumer record with key error

HariSekhon commented 5 years ago

That's an old version of check_kafka.py - 0.3.9, current is 0.5.3.

Can you please make update and then try again with the latest version so that the traceback matches the current code for debugging?