Open PenelopeFudd opened 2 months ago
Thank you for this type of feedback, it helps improve the tool.
Here are a few points to investigate:
Ok, will test.
Er, how does one run a Go profiling?
It looks like making those changes has sharply curtailed the memory usage; it's now about 125-148MB, although it goes up and down.
At the moment, each dnsdist frontend is taking 5.7-6.6% of CPU, while each dnscollector backend is taking 3.0-3.7%. With 10 of each, that adds up to a lot. Filebeat is currently taking 1.2GB of memory and 3.3% of CPU.
Is there a way to increase the number of dnscollector threads, so I don't need to run ten copies? Alternatively, is there a way to split a dnstap stream into multiple dnscollectors with some kind of load balancer?
Running 10 dnsdists and dnscollectors uses 3.89GB (12%), so setting buffers to 4096 and commenting out the transforms really helped. Running 5 dnsdists and dnscollectors uses 3.32GB (11%), so not much change there.
Having only 5 dnsdists and dnscollectors changed the CPU numbers a bit:
Since the results are so close, I'll have to check the logs for discards and other warnings to see what to do next.
Thanks!
Oh, there are definitely errors:
$ journalctl -S '20 seconds ago' -u dnscollector@backend1 -o json | jq -r .MESSAGE
ERROR: 2024/09/15 22:58:02.848611 worker - [tap] dnstap - worker[dnstap-processor] buffer is full, 243841 dnsmessage(s) dropped
ERROR: 2024/09/15 22:58:03.830434 worker - [tap] (conn #1) dnstap processor - worker[filebeat] buffer is full, 108452 dnsmessage(s) dropped
ERROR: 2024/09/15 22:58:12.854694 worker - [tap] dnstap - worker[dnstap-processor] buffer is full, 282473 dnsmessage(s) dropped
ERROR: 2024/09/15 22:58:13.832142 worker - [tap] (conn #1) dnstap processor - worker[filebeat] buffer is full, 96872 dnsmessage(s) dropped
Partial output of nethogs
:
PID USER Program Device Sent Received
1338604 _dnsdi.. /usr/bin/dnsdist lo 8323.941 0.000 KB/sec
1338605 _dnsdi.. /usr/bin/dnsdist lo 8273.483 0.000 KB/sec
1338602 _dnsdi.. /usr/bin/dnsdist lo 8028.614 0.000 KB/sec
1338603 _dnsdi.. /usr/bin/dnsdist lo 7926.192 0.000 KB/sec
1338571 _dnsdi.. /usr/bin/dnsdist lo 7519.153 0.000 KB/sec
1338606 _dnsdi.. /usr/bin/dnsdist lo 6912.262 0.000 KB/sec
1338607 _dnsdi.. /usr/bin/dnsdist lo 6701.061 0.000 KB/sec
1338559 _dnsdi.. /usr/bin/dnsdist lo 6670.163 0.000 KB/sec
1338567 _dnsdi.. /usr/bin/dnsdist lo 5766.493 0.000 KB/sec
1338443 dnscol.. /usr/bin/go-dnscollector lo 10.648 0.000 KB/sec
1338420 dnscol.. /usr/bin/go-dnscollector lo 9.384 0.000 KB/sec
1338415 dnscol.. /usr/bin/go-dnscollector lo 9.371 0.000 KB/sec
1338425 dnscol.. /usr/bin/go-dnscollector lo 9.309 0.000 KB/sec
1338410 dnscol.. /usr/bin/go-dnscollector lo 8.985 0.000 KB/sec
1338426 dnscol.. /usr/bin/go-dnscollector lo 8.985 0.000 KB/sec
1338409 dnscol.. /usr/bin/go-dnscollector lo 8.366 0.000 KB/sec
1338465 dnscol.. /usr/bin/go-dnscollector lo 8.126 0.000 KB/sec
1338464 dnscol.. /usr/bin/go-dnscollector lo 8.031 0.000 KB/sec
1338447 dnscol.. /usr/bin/go-dnscollector lo 7.515 0.000 KB/sec
Given that dnsdist is sending 8000KB/sec of dnstap logs to dnscollector, and dnscollector is sending 10KB/sec of logs to filebeat, I'm thinking that filebeat is not accepting the logs fast enough, by a factor of about 800. And if dnstap is binary and the logs sent to filebeat are json, that factor is even bigger on a per-packet basis.
The JSON format can be resource-intensive. You can review a benchmark. As you can see, the basic text inline format is the most efficient in terms of CPU usage.
Is there a way to increase the number of dnscollector threads, so I don't need to run ten copies?
Goroutines are used for each incoming connection, so you can use a single binary and listen on multiple ports simultaneously, effectively handling multiple streams within the same instance.
Alternatively, is there a way to split a dnstap stream into multiple dnscollectors with some kind of load balancer?
Currently, no. You can use the DNStap profixier to duplicate the data flow, but it does not support splitting streams across multiple collectors. Feel free to open a feature request to track this enhancement.
Describe the bug As we use go-dnscollector, it uses more and more memory until it's killed by the OOM killer.
We initially added
LimitDATA=5500M
and laterMemoryMax=1500M
to the systemd service unit file. Then we graphed MemoryCurrent over time:The process ran out of memory after 45 seconds.
To Reproduce We have 10 copies of dnsdist running, each sending dnstap logs to a separate copy of dnscollector. Excerpt from
/etc/dnsdist/dnsdist-backend1.conf
:Contents of
/etc/dnscollector/config-backend1.yml
:The reason we have 10 copies of dnsdist + dnscollector is because dnsdist started discarding dnstap records under high load, as dnscollector wasn't keeping up. If we could have configured dnscollector to start ten threads, or dnsdist to send dnstap records to ten different backends, we would have. The goal is to handle 200k DNS requests per second, which ends up being 400k records per second.
Expected behavior No memory leaking. I can't imagine what dnscollector could be retaining records for apart from latency calculations, and those expire after 2 seconds.
Additional context
Version
bacdb535927b96f4c1fde0b753ba55f0189c2199
plus one patch:}