compactor: Trying to find the bottleneck under high load

thibmac commented 3 years ago

Hello,

Not sure if a Github issue is the right move for my problem below but here it is:

I have recently been testing compactor under high load of incoming queries. I am basically generating 120kqps from 4 different sources on a single destination where my DNS service is listening

Here is my compactor config :

interface=eth1
log-network-stats-period=5
rotation-period=60
output=/var/log/cdns/%Y%m%d-%H%M%S.cdns
include=all
[rest are defaults]

For ~15/20 seconds, Everything seems fine according to the logs and the integrated stats, we are seeing the 120kqps being detected. However, after these few seconds, I can notice the compactor is unable to keep up with the charge, and the amount of queries registered goes down :

Apr  1 07:38:09 my-server compactor[47345]: Total 601805 (120361 pkt/s), Dropped raw/ignored/C-DNS packets 0/0/0
Apr  1 07:38:09 my-server compactor[47345]: PCAP recv 884118 drop 13027 drop at iface 0
Apr  1 07:38:16 my-server compactor[47345]: Total 510311 (102062 pkt/s), Dropped raw/ignored/C-DNS packets 0/0/0
Apr  1 07:38:16 my-server compactor[47345]: PCAP recv 1780277 drop 363343 drop at iface 0
Apr  1 07:38:24 my-server compactor[47345]: Total 318789 (63757 pkt/s), Dropped raw/ignored/C-DNS packets 0/0/0
Apr  1 07:38:24 my-server compactor[47345]: PCAP recv 2691462 drop 952679 drop at iface 0
Apr  1 07:38:27 my-server compactor[47345]: Total 216812 (43362 pkt/s), Dropped raw/ignored/C-DNS packets 0/0/0
Apr  1 07:38:27 my-server compactor[47345]: PCAP recv 3058092 drop 1103025 drop at iface 0
Apr  1 07:38:33 my-server compactor[47345]: Total 315152 (63030 pkt/s), Dropped raw/ignored/C-DNS packets 0/0/0
Apr  1 07:38:33 my-server compactor[47345]: PCAP recv 3871527 drop 1603108 drop at iface 0

Overall, the drop rate with CDNS capture is ~50%, but shouldn't we expect this kind of message at some point during the capture?

C-DNS overflow. Dropping query/response(s).

When setting the capture mode to pcap, I don't notice more than 1% of drops.

I am having trouble understanding how to avoid this when using CDNS capture. Increasing CPU's doesnt seem very effective since I am seeing the same problem on a 6 CPU server. Disk I/O are very low on usage, and CPU usage is sitting at around 25%usr on each CPU's.

Playing with parameters other than the capture mode didn't help much either.

Is it a design limitation that we should expect here or I am missing something?

Cheers.

banburybill commented 3 years ago

Did you build compactor yourself or install it from a package? In the case of the latter, the packaging by default restricts compactor to 1 CPU (the assumption being that it's running on the same server as the nameserver, and should not take CPU unduly from that) if started as a service. On 1 CPU 120kqps will overwhelm compactor, but I think should be possible with more cores.

Internally compactor uses separate threads for sniffing the network, matching query/responses, building C-DNS blocks and 1 thread each for each active compression, if compression is being used. Data moves between the first 3 threads along Go-ish channels. By default (it's configurable) these are limited to 300k entries each. If a thread tries to add an entry to a channel and the channel is full, the overflow message is logged, but only on the first occurrence, so as not to explode the logs.

However, particularly on a single CPU with limited memory (the package also restricts memory use for the same reasons as CPU), the overflow message doesn't get produced, I think because the CPU is fully committed and so pcap starts missing packets. However, that doesn't seem to match your observation that the CPU isn't more than 25% used.

So, if you did install from a package, check the systemd service file (/lib/systemd/system/dns-stats-compactor.service) and override the settings therein. Try runing compactor by hand first.

Otherwise, the other performance tip is to experiment with compression, if you're using compression. Our main deployment uses xz compression, but at level 2 not the default level 6; we found it makes a significant difference to the throughput we can sustain.

Oh, and also if you're building compactor by hand, using tcmalloc rather than the stock glibc allocator makes a useful difference. I've not done much profiling, but what I have done suggests that compactor is spending significant time in heap management. But we've no pressing need at present to spend time addressing this...

thibmac commented 3 years ago

Thank you for the explanation

We built the compactor ourselves and it is installed on our internal debian repo.

Our compactor installation was using every available CPU by default when looking with pidstat, and that's where we can see the 25% usage on each cpu of the machine.

I have tried recompiling on the machine directly from the sources with tcmalloc, then using xz compression, and I am seeing similar results when lauching the capture from the newly built compactor.

Testing the capture on multiple interfaces also gives the same exact behaviour.

Could it simply be a CPU speed problem here ? I remain quite unsure on what the problem is and what is causing the drops.

I'll keep testing different things and see what I can find.

Cheers

thibmac commented 3 years ago

Updating with more tests and a bit more understanding :

First of all the 25% I spoke about was wrong and and average of each CPU at the moment compactor is under high load. Each "active" CPU is using 100%+ resources - see pidstat 1 -t -C comp below (average for 30 secs)

Moyenne :   UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
Moyenne :  5667    358500         -   73,91   67,09    0,00   14,10  141,00     -  compactor
Moyenne :  5667         -    358500   37,79   32,78    0,00   14,10   70,57     -  |__compactor
Moyenne :  5667         -    358501   18,27   15,20    0,00   23,96   33,48     -  |__comp:cdns-write
Moyenne :  5667         -    358502   17,65   19,27    0,00   24,91   36,93     -  |__comp:sniffer

Here's the same command using the raw-pcap output instead of CDNS

Moyenne :  5667    358400         -   33,67   52,18    0,00    8,71   85,84     -  compactor
Moyenne :  5667         -    358400    7,35   14,61    0,00    8,71   21,96     -  |__compactor
Moyenne :  5667         -    358401   11,43   13,97    0,00   13,34   25,41     -  |__comp:raw-pcap
Moyenne :  5667         -    358402   14,52   23,96    0,00   20,96   38,48     -  |__comp:sniffer

Increasing the queue size is a good solution in case the load is coming from bursts of queries or pps, but the queue eventually gets full after a moment anyway at some point.

Cheers

banburybill commented 3 years ago

Since your hitting queue overflow problems (is this both with raw-pcap and cdns?), I suspect that disc output simply isn't being written fast enough - it doesn't look like the write thread is in serious CPU pressure. Does iotop give any clues? In our main deployment we're writing to SSD.

thibmac commented 3 years ago

Overflow problems only seem to occur when writing to cdns here.

At 120k pkts/s, CPU usage with cdns is > 100% while usage when writing pcap is around 85%, for the process compactor We are also writing to SSD and never noticed any I/O wait during the tests.

Both cdns-write and raw-pcap procfesses have similar CPU usage so I agree this does not looking like a writing issue.

However the CPU usage of the child "compactor" process has a quite noticeable difference when switch from pcap to cdns

raw-pcap -> 20%:

  Moyenne :   UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
  Moyenne :  5667         -    358400    7,35   14,61    0,00    8,71   21,96    -  |__compactor

cdns -> 70%:

  Moyenne :   UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
  Moyenne :  5667         -    358500   37,79   32,78    0,00   14,10   70,57     -  |__compactor

banburybill commented 3 years ago

I'd expect comapctor to take more CPU writing C-DNS - it has to collect query/responses into blocks and common up address etc. data.

What kind of throughput do you see on the same hardware if you use compactor to convert a PCAP capture file to C-DNS? That should give an upper bound, at least, on the speed with which compactor can convert DNS packets to C-DNS and write it out.

thibmac commented 3 years ago

So I generated exactly 10 minutes of pcap, with rotation-period 60, while still throwing 120k qps to the compactor. Each of the 10 pcaps generated weigh around 660M.

Then I converted these pcaps to CDNS using this command : /opt/dns-stats-compactor/bin/compactor pcap_file.pcap -o cdns_output.cdns

It took exactly 7mn30s to convert 10mn of pcap to cdns when. When converting cdns to pcap, the machine was not receiving any trafic. Conversion was the only task generating load at that moment. Each CDNS weighs around 170M here (no compression used)

We can see similar numbers of CPU usage during the pcap->cdns conversion

Moyenne :     0    563067         -  119,47   45,21    0,00    0,99  164,69     -  compactor
Moyenne :     0         -    563067   67,00   22,44    0,00    0,99   89,44     -  |__compactor
Moyenne :     0         -    563068   29,70    7,92    0,00   11,55   37,62     -  |__comp:cdns-write
Moyenne :     0         -    563069   22,44   15,51    0,00   23,43   37,95     -  |__comp:sniffer

EDIT:

I've done the same then, but with pcap->cdns conversion with a machine under high load, and the conversion took approx. 30 minutes

banburybill commented 3 years ago

On Thursday, 22 April 2021 15:35:03 BST thibmac wrote:

While trying to debug a bit of code and playing around for my own understanding, i have been unable to gather accurate drop stats with the stats.output_cbor_drop_count counter

My stats log always look like this, even though ~half of the queries are actually dropped
Apr 22 13:23:52 my-server compactor[1331663]: Total 335660 (67132 pkt/s),
Dropped raw/ignored/C-DNS packets 0/0/0
Apr 22 13:23:52 my-server compactor[1331663]: PCAP recv 3546944 drop 1049637 
drop at iface 0
Apr 22 13:23:54 my-server compactor[1331663]: Total 297512 (59502 pkt/s),
Dropped raw/ignored/C-DNS packets 0/0/0
Apr 22 13:23:54 my-server compactor[1331663]: PCAP recv 3546954 drop 1049637 
drop at iface 0 ```

I just had to remind myself that while the Total/Dropped numbers are a count since the last report, the PCAP numbers are all cumulative. The other difference is that the reported PCAP numbers are based on calling pcap_stats() at the moment of reporting, so they reflect where libpcap is right now, rather than the point in the processing of the incoming data stream where compactor is. Also, it's not clear from the pcap_stats() documentation (https:// www.tcpdump.org/manpages/pcap_stats.3pcap.html) exactly what those numbers may actually mean.

Looking only at the compactor counters for a moment (so ignoring the PCAP counts, which come straight from libpcap), they are curious. Each count is the count since the last report. The total packets per second figure is the total packets figure divided by the configured reporting period. 335660/67132 is 5, so it looks like we're reporting every 5 seconds. But the log entries have timestamps 2s apart. So the second entry seems to be saying that compactor received 297512 packets since the last report, while libpcap says it has received an additional 3546954-3546944=10 packets in the last 2 seconds. Confusing.

Heading into src/compactor.cpp now :

In the QueryResponseMatcher matcher at line 519 The condition output.cbor->put(cbi, false) always ends up being true ,so the counter stats.output_cbor_drop_count never increases.

Shouldn't we expect this condition to fail when the query is unable to be inserted in the queue?

Why do you say output.cbor->put(cbi, false) always ends up true? channel.hpp line 97 has it returning false if the queue is at its maximum size. I have seen non-zero counts here.

inspector reports also show that these queries weren't counted (we should expect ~5M drops here):

  Total Packets processed                  : 2963593
  Processed DNS messages           (C-DNS) : 2963343
  Matched DNS query/response pairs (C-DNS) : 370106
  Unmatched DNS queries            (C-DNS) : 1909032
  Unmatched DNS responses          (C-DNS) : 284817
  Discarded OPCODE DNS messages    (C-DNS) : 0
  Malformed DNS messages           (C-DNS) : 0
  Non-DNS packets                          : 249
  Out-of-order DNS query/responses         : 21460
  Dropped C-DNS items (overload)           : 0
  Dropped raw PCAP packets (overload)      : 0
  Dropped non-DNS packets (overload)       : 0

Compactor doesn't store counts of anything dropped at the PCAP level, so these numbers are consistent with the log messages above. -- Jim Hague - @.*** Never trust a computer you can't lift.

banburybill commented 3 years ago

Looking more closely at this issue, we've found that running compactor under a virtual machine rather than directly on the physical host makes a HUGE difference to the throughput rates compactor can sustain without dropping packets.

Profiling under a VM and on the physical host shows that a large amount of time when running under the VM is spent in Channel::put(), specifically in std::unique_lock/std::mutex/std::condition_variable::wait. We're looking into what we can do about this.

saradickinson commented 2 years ago

In addition to the issue identified about with VM's we've added new handling in 1.2.1 so the internal queues cope better at high load, memory consumption is constrained under high load and there are new new options to enable sampling of packets with user specified thresholds. We hope these help to address issues when compactor has to cope with high thought put. Please review. I'm going to close this issue for now as our general recommendation is not to run compactor on a VM and to use one of the new options provided. Please feel free to open a new issue if you find specific problems with the behaviour of 1.2.1. Thanks!

dns-stats / compactor

compactor: Trying to find the bottleneck under high load #69