Burrow unreadiness when metrics sidecar gets OOMKilled

solsson commented 5 years ago

I'm logging this issue because there shouldn't be a relation between burrow and the JMX exporter.

To reproduce:

Run Kafka with the metrics container, grant it only enough memory to start and run.
Run Burrow.
Hit the metrics endpoint on one broker so that metrics gets oomkilled.
The broker will be 1/2 ready.
Burrow typically shows unreadiness.

It's noteworthy that Burrow is configured to access brokers through headless service broker name resolution. That differs from the typical bootstrap process that kafka clients will do. However bootstrap might also be affected, in particular if all metrics pods get oomkilled at the same time. I was unaware until I read the librdkafka 1.0.0 release notes that bootstrap is a persistent connection.

solsson commented 5 years ago

Maybe we should use https://github.com/kubernetes/kubernetes/pull/63742 for the broker service. The current readiness probe on broker containers (any response to tcp port 9092) doesn't add any actual health checking.

solsson commented 5 years ago

{"level":"info","ts":1553708082.5494027,"msg":"Recv loop terminated: err=read tcp 10.0.23.17:58414->10.3.255.250:2181: i/o timeout","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553708082.5494552,"msg":"Send loop terminated: err=<nil>","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553708083.6529753,"msg":"Connected to 10.3.255.250:2181","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553708083.6588173,"msg":"Authenticated: id=245911530986602516, timeout=6000","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553708083.6598747,"msg":"Re-submitting `0` credentials after reconnect","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553709947.9208763,"msg":"cluster or consumer not found","type":"module","coordinator":"evaluator","class":"caching","name":"default","cluster":"local","consumer":"site-vcc-qa-kkv-userstate-5cf9d9d9d8-tpdn4-20190319t203919","showall":true}
{"level":"info","ts":1553710578.1879478,"msg":"cluster or consumer not found","type":"module","coordinator":"evaluator","class":"caching","name":"default","cluster":"local","consumer":"site-vcc-qa-kkv-userstate-5cf9d9d9d8-tpdn4-20190320t180610","showall":true}
{"level":"info","ts":1553712227.163901,"msg":"Recv loop terminated: err=read tcp 10.0.23.17:46212->10.3.255.250:2181: i/o timeout","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553712227.1639652,"msg":"Send loop terminated: err=<nil>","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553712256.1126459,"msg":"Shutdown triggered","type":"main","name":"burrow"}
{"level":"info","ts":1553712256.112681,"msg":"stopping","type":"coordinator","name":"consumer"}
{"level":"info","ts":1553712256.1126876,"msg":"stopping","type":"module","coordinator":"consumer","class":"kafka","name":"local"}
{"level":"info","ts":1553712258.7277691,"msg":"stopping","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1553712258.7414002,"msg":"Recv loop terminated: err=EOF","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1553712258.7414432,"msg":"Send loop terminated: err=<nil>","type":"module","coordinator":"consumer","class":"kafka_zk","name":"local_zk"}
{"level":"info","ts":1553712258.7414908,"msg":"stopping","type":"coordinator","name":"cluster"}
{"level":"info","ts":1553712258.7414985,"msg":"stopping","type":"module","coordinator":"cluster","class":"kafka","name":"local"}
{"level":"info","ts":1553712259.6486397,"msg":"Connected to 10.3.255.250:2181","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553712259.6560361,"msg":"Authentication failed: zk: session has been expired by the server","type":"coordinator","name":"zookeeper"}
{"level":"error","ts":1553712259.6560826,"msg":"session expired","type":"coordinator","name":"zookeeper"}
{"level":"info","ts":1553712259.656111,"msg":"stopping evaluations","type":"coordinator","name":"notifier"}
{"level":"info","ts":1553712259.6727962,"msg":"stopping","type":"coordinator","name":"notifier"}
{"level":"info","ts":1553712259.672839,"msg":"shutdown","type":"coordinator","name":"httpserver"}
{"level":"info","ts":1553712259.6729908,"msg":"stopping","type":"coordinator","name":"evaluator"}
{"level":"info","ts":1553712259.6730008,"msg":"stopping","type":"module","coordinator":"evaluator","class":"caching","name":"default"}
{"level":"info","ts":1553712259.6730094,"msg":"stopping","type":"coordinator","name":"storage"}
{"level":"info","ts":1553712259.6730392,"msg":"stopping","type":"module","coordinator":"storage","class":"inmemory","name":"default"}
{"level":"info","ts":1553712259.6731503,"msg":"stopping","type":"coordinator","name":"zookeeper"}
Stopped Burrow at March 27, 2019 at 6:44pm (UTC)

weeco commented 5 years ago

@solsson I am bit hijacking this issue, but since we had some challenges with Burrow, I wrote an own exporter for Kafka consumer group lags, which works similiarly to Burrow. The benefits however are:

It exposes more valuable metrics (e. g. last commit timestamp for each consumergroup:topic:partition, partition and topic lag)
It's been created to expose prometheus metrics. This has two plusses. Burrow has to struggle with evaluation features which I do not need to care about, since Grafana alerting already offers them
It has a couple more/other features which we needed in our environment (e. g. recognizing group versions in group names and adding these information as labels).
Only expose prometheus metrics when the consumer_offsets topic has been initially consumed. If you expose metrics before you have initially consumed that topic (which may take hours, depending on the size of your topic) you'll get outdated consumer group offsets. This way you can run multiple instances of a prometheus exporter and therefore provide highly available metrics which are correct (no more alerting because of highlags when burrow restarts)
More performant consuming of __consumer_offsets topic compared to burrow. I was able to process 150k messages / second with just 2 CPU cores.

I am actively developing it, and I plan to maintain it for a long time and thus I am happy to give support if you face any issues. Before writing articles about it / making it more public, I'd like to get more feedback about it. Are you interested in giving it a spin?

https://github.com/google-cloud-tools/kafka-minion

solsson commented 5 years ago

@weeco That's a most welcome initiative! I'm all for such ambitions hijacking of issues :) Yes I'm/we're interested in giving it a spin. Is there a docker image and yamls to start from? If not it sounds easy to set up so I could probably create the PR.

weeco commented 5 years ago

Sure we build docker containers using quay (it creates a docker tag for each release and builds "latest" every time we push something onmaster): https://quay.io/repository/google-cloud-tools/kafka-minion?tab=tags .

docker pull quay.io/google-cloud-tools/kafka-minion:v0.1.1

Regarding deployment yamls: I am still working on Helm charts. They are missing some environment variables (primarily how to mount kafka secrets): https://github.com/google-cloud-tools/kafka-minion-helm-chart , but they can give you a start to write YAMLs. All configuration can be done via environment variables, and for all environment variables there is a table in Kafka Minion's readme.

Looking forward to your feedback :).

Yolean / kubernetes-kafka

Burrow unreadiness when metrics sidecar gets OOMKilled #255