SeisoLLC / zeek-kafka

A Zeek log writer plugin that publishes to Kafka.
Apache License 2.0
45 stars 16 forks source link

Logger crashes when sending large amounts of data through kafka #22

Closed emnahum closed 3 years ago

emnahum commented 3 years ago

Summary of the issue

Logger crashes when sending large amounts of data through kafka

Expected behavior

Logger should not crash, just work

Steps to reproduce

compile zeek, zkg install zeek-kafka set up a cluster (details below) wait about 5-10 minutes logger crashes with following output from zeekctl:

=== reporter.log 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty)

the "Kafka send failed" is from line 270 of KafkaWriter.cc in the zeek-kafka source.

node.cfg:

[logger] type=logger host=localhost

[manager] type=manager host=localhost

[proxy-1] type=proxy host=localhost

[worker-1] type=worker host=localhost interface=ens1f0 lb_method=pf_ring lb_procs=22 pin_cpus=1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,23,24

[worker-2] type=worker host=localhost interface=ens1f1 lb_method=pf_ring lb_procs=22 pin_cpus=25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46

Where applicable, consider providing a patch that uses the end to end testing environment.

Logs, errors, etc.

...

Your environment

zeek-3.0.12 zeek-kafka version 0.3.0 (whatever is installled over zkg) librdkafka ~1.4.2 Ubuntu 20.04.02 (Linux 5.4.0-70-generic x86_64) pf_ring 7.8

ottobackwards commented 3 years ago

This isn't crashing, it is handling the reply from RDKafka. Can you change the title? This error means that zeek is writing more than RDKafka can handle I believe.

emnahum commented 3 years ago

Well, the logger crashes, this is the output I get from "diag logger". What is a better title? Is the problem that kafka can't handle the load, or that the receiver isn't pulling fast enough?

ottobackwards commented 3 years ago

Well, there is no "crash" here that I can see. The plugin is handling the error and logging a message. Can you add the crash information? I need to check to see if we are returning the correct value from the method based on what is happening API-wise. It is possible that we are, and that the 'logger' issue will have to be handled by zeek itself.

Kakfa has an Internal queue it uses. when you call produce messages are put there and usually sent right away. That queue is also where delivery notifications are kept as well, why we call poll() as well as produce. Between those two things ( what you are putting in the queue and the delivery notifications ) may be why the queue is filling up.

I'm not sure why the queue is not emptying. Network issues? What is your plugin configuration? Are there other errors before this? How much data are you trying to send? Issues on the brokers?

This is a common kafka problem it looks like. In the plugin, what we need to do it:

If all those things were to be true, the behavior would be better defines, and you would be able to tune it for your needs. But in the end you will need to understand your kafka system.

@JonZeolla ?

emnahum commented 3 years ago

Well, as far as the crash info, zeekctl says that the logger is crashed, but I don't find any core file, despite ulimit -c and kernel.core_pattern = core.%e-%t-%p. I suppose I should ask the slack channel about that.

Plugins: I don't have any other plugins. Network issues: maybe, the producer and the broker are on opposite coasts. No other errors that I see. Data: I'm capturing two 10 Gbs links on a 48 core Intel E5-2650 with 128 MB RAM, 22 cores for each interface (the node.cfg). The capturing machine (producer) is running at about 43 load. The broker is a 80 core Intel 6248, 192 MB running with a load of about 3, thus barely utilized. There is one broker on the 80 core machine. I've run the experiment with a consumer (logstash) and without, and it seems to make no difference.

So you're probably right that I will need to understand my kafka system, and the plugin is not the problem.

ottobackwards commented 3 years ago

Well, I'm not saying that there isn't some problem with the plugin. Please don't get me wrong, I'm not trying to brush this issue off.

I don't see that we do anything in the plugin to handle this situation, and it would make a lot of sense if we did handle it per best practices for Kakfa producers.

emnahum commented 3 years ago

Justin Azoff pointed out on the zeek slack channel that the OOM killer on linux killed the logger process, not the zeek-kafka plugin. So this is not a bug (for zeek-kafka at least).

I also ran the system with just 10 Gbps overnight and it handled the load just fine. So I need to do some load testing to figure out how far I can push it.

JonZeolla commented 3 years ago

If there is any librdkafka tuning that makes sense for your environment you can pass it using https://github.com/SeisoLLC/zeek-kafka#kafka_conf

JonZeolla commented 3 years ago

@ottobackwards I agree that it makes sense to look into a few things to confirm it's optimal.

ottobackwards commented 3 years ago

The errors are pretty common, there are suggestions abound on setting batching or other things to help it out, those could be passed in the way that Jon says.