Closed emnahum closed 3 years ago
This isn't crashing, it is handling the reply from RDKafka. Can you change the title? This error means that zeek is writing more than RDKafka can handle I believe.
Well, the logger crashes, this is the output I get from "diag logger". What is a better title? Is the problem that kafka can't handle the load, or that the receiver isn't pulling fast enough?
Well, there is no "crash" here that I can see. The plugin is handling the error and logging a message. Can you add the crash information? I need to check to see if we are returning the correct value from the method based on what is happening API-wise. It is possible that we are, and that the 'logger' issue will have to be handled by zeek itself.
Kakfa has an Internal queue it uses. when you call produce messages are put there and usually sent right away. That queue is also where delivery notifications are kept as well, why we call poll() as well as produce. Between those two things ( what you are putting in the queue and the delivery notifications ) may be why the queue is filling up.
I'm not sure why the queue is not emptying. Network issues? What is your plugin configuration? Are there other errors before this? How much data are you trying to send? Issues on the brokers?
This is a common kafka problem it looks like. In the plugin, what we need to do it:
If all those things were to be true, the behavior would be better defines, and you would be able to tune it for your needs. But in the end you will need to understand your kafka system.
@JonZeolla ?
Well, as far as the crash info, zeekctl says that the logger is crashed, but I don't find any core file, despite ulimit -c and kernel.core_pattern = core.%e-%t-%p. I suppose I should ask the slack channel about that.
Plugins: I don't have any other plugins. Network issues: maybe, the producer and the broker are on opposite coasts. No other errors that I see. Data: I'm capturing two 10 Gbs links on a 48 core Intel E5-2650 with 128 MB RAM, 22 cores for each interface (the node.cfg). The capturing machine (producer) is running at about 43 load. The broker is a 80 core Intel 6248, 192 MB running with a load of about 3, thus barely utilized. There is one broker on the 80 core machine. I've run the experiment with a consumer (logstash) and without, and it seems to make no difference.
So you're probably right that I will need to understand my kafka system, and the plugin is not the problem.
Well, I'm not saying that there isn't some problem with the plugin. Please don't get me wrong, I'm not trying to brush this issue off.
I don't see that we do anything in the plugin to handle this situation, and it would make a lot of sense if we did handle it per best practices for Kakfa producers.
Justin Azoff pointed out on the zeek slack channel that the OOM killer on linux killed the logger process, not the zeek-kafka plugin. So this is not a bug (for zeek-kafka at least).
I also ran the system with just 10 Gbps overnight and it handled the load just fine. So I need to do some load testing to figure out how far I can push it.
If there is any librdkafka tuning that makes sense for your environment you can pass it using https://github.com/SeisoLLC/zeek-kafka#kafka_conf
@ottobackwards I agree that it makes sense to look into a few things to confirm it's optimal.
The errors are pretty common, there are suggestions abound on setting batching or other things to help it out, those could be passed in the way that Jon says.
Summary of the issue
Logger crashes when sending large amounts of data through kafka
Expected behavior
Logger should not crash, just work
Steps to reproduce
compile zeek, zkg install zeek-kafka set up a cluster (details below) wait about 5-10 minutes logger crashes with following output from zeekctl:
=== reporter.log 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty) 1619025854.806638 Reporter::ERROR conn/Log::WRITER_KAFKAWRITER: Kafka send failed: Local: Queue full (empty)
the "Kafka send failed" is from line 270 of KafkaWriter.cc in the zeek-kafka source.
node.cfg:
[logger] type=logger host=localhost
[manager] type=manager host=localhost
[proxy-1] type=proxy host=localhost
[worker-1] type=worker host=localhost interface=ens1f0 lb_method=pf_ring lb_procs=22 pin_cpus=1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,23,24
[worker-2] type=worker host=localhost interface=ens1f1 lb_method=pf_ring lb_procs=22 pin_cpus=25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46
Where applicable, consider providing a patch that uses the end to end testing environment.
Logs, errors, etc.
...
Your environment
zeek-3.0.12 zeek-kafka version 0.3.0 (whatever is installled over zkg) librdkafka ~1.4.2 Ubuntu 20.04.02 (Linux 5.4.0-70-generic x86_64) pf_ring 7.8