Tantalor93 / dnspyre

CLI tool for a high QPS DNS benchmark
https://tantalor93.github.io/dnspyre/
MIT License
84 stars 10 forks source link

Packets lost before sending #257

Open PenelopeFudd opened 1 month ago

PenelopeFudd commented 1 month ago

What happened Ran pdns_server on port 91 with *.doh-test.com TXT "wildcard.doh-test.com!" in a doh-test.com zone. Ran tcpdump -w /tmp/a.pcap port 91 Ran dnspyre --no-distribution --no-progress --json -t TXT --server 172.16.200.25:91 --tcp -d 10s -c 1 -t TXT @infile.txt Infile.txt contains testit-1.doh-test.com to testit-10000000.doh-test.com (10 million records).

Json output:

{
  "totalRequests": 46420,
  "totalSuccessResponses": 42242,
  "totalNegativeResponses": 0,
  "totalErrorResponses": 0,
  "totalIOErrors": 4178,
  "TotalIDmismatch": 0,
  "totalTruncatedResponses": 0,
  "responseRcodes": {
    "NOERROR": 42242
  },
  "questionTypes": {
    "TXT": 46420
  },
  "queriesPerSecond": 3892.1,
  "benchmarkDurationSeconds": 11.93,
  "latencyStats": {
    "minMs": 0,
    "meanMs": 0,
    "stdMs": 0,
    "maxMs": 2,
    "p99Ms": 0,
    "p95Ms": 0,
    "p90Ms": 0,
    "p75Ms": 0,
    "p50Ms": 0
  }
}

Analyzed the capture with wireshark:

Conclusion 1: Dnspyre lost 4178 requests before they even got to the outgoing TCP stack.

Conclusion 2: Dnspyre is sending every hostname twice.

What you expected to happen Expected 0 errors, as lost packets would have been retransmitted in the middle of a single TCP conversation. Expected each hostname to be sent once.

dnspyre command

dnspyre -t TXT --server 172.16.200.25:91 --tcp -d 10s -c 1 -t TXT @infile.txt
PenelopeFudd commented 1 month ago

At least now I know why DNSdist always report a 50% cache hit rate even with unique hostnames. 😄

PenelopeFudd commented 1 month ago

When this is fixed, then --fail=ioerror will become much more useful. 😄

Tantalor93 commented 1 month ago

could you please also repeat the test with a standard output (no --json flag) and share the errors it reports? Also request logging I added recently might be useful for checking whether some duplication actually happens https://tantalor93.github.io/dnspyre/requestlog.html, can you try to check that also? 🙂

PenelopeFudd commented 1 month ago

Ok:

# dnspyre --no-distribution --no-progress -t TXT --server 172.16.200.25:91 --tcp -d 10s -c 1 -t TXT @infile.txt
(Start: 2024-05-16 13:44:32 PDT)
Using 10000000 hostnames
Benchmarking 172.16.200.25:91 via tcp with 1 concurrent requests

Total requests:     906944
Read/Write errors:  868453
DNS success responses:  38491

DNS response codes:
    NOERROR:    38491

DNS question types:
    TXT:    906944

Time taken for tests:    14.82s
Questions per second:    61192.7
DNS timings, 38491 datapoints
     min:        0s
     mean:       236.98µs
     [+/-sd]:    128.84µs
     max:        1.57ms
     p99:        524.29µs
     p95:        524.29µs
     p90:        524.29µs
     p75:        524.29µs
     p50:        262.14µs

Total Errors: 868453
Top errors:
dial tcp 172.16.200.25:91   868452 (100.00)%
write tcp 172.16.200.25:91  1 (0.00)%
PenelopeFudd commented 1 month ago

I'm including the tcpdump of the traffic, made with tcpdump -i wan0 -w c.pcap port 91: capture.zip

PenelopeFudd commented 1 month ago

Including the request log from a different run of the same command: requests.log

# dnspyre --no-distribution --no-progress -t TXT --server 172.16.200.25:91 --tcp -d 10s -c 1 -t TXT @infile.txt --log-requests --log-requests-path /tmp/requests.log
(Start: 2024-05-16 14:18:17 PDT)
Using 10000000 hostnames
Benchmarking 172.16.200.25:91 via tcp with 1 concurrent requests

Total requests:     39927
Read/Write errors:  2327
DNS success responses:  37600

DNS response codes:
    NOERROR:    37600

DNS question types:
    TXT:    39927

Time taken for tests:    12.39s
Questions per second:    3222.3
DNS timings, 37600 datapoints
     min:        0s
     mean:       218.58µs
     [+/-sd]:    129.13µs
     max:        3.93ms
     p99:        524.29µs
     p95:        524.29µs
     p90:        524.29µs
     p75:        524.29µs
     p50:        262.14µs

Total Errors: 2327
Top errors:
dial tcp 172.16.200.25:91   2326 (99.96)%
write tcp 172.16.200.25:91  1 (0.04)%
PenelopeFudd commented 1 month ago

I was looking at the request log, and it doesn't seem to have the responses included. Could that be added?

Tantalor93 commented 1 month ago

I was looking at the request log, and it doesn't seem to have the responses included. Could that be added?

request log contains also responses, each line represents single request sent and related response received (if any)

PenelopeFudd commented 1 month ago

Command:

dnspyre --no-distribution --no-progress --json -t TXT --server https://pdnsdist.example.com:444/dns-query \
   --doh-protocol=2 -d 15s -c 2000 --log-requests --log-requests-path=/tmp/dnspyre.2024-05-16_17:06:11 \
   -t TXT @infile.txt

Part of the log:

2024/05/16 23:31:43 worker:[1983] reqid:[2968] qname:[test1.doh-test.com.] qtype:[TXT] respid:[2968] rcode:[NOERROR] respflags:[qr aa rd] err:[<nil>] duration:[16.920544ms]
2024/05/16 23:31:43 worker:[310] reqid:[48940] qname:[test1.doh-test.com.] qtype:[TXT] respid:[48940] rcode:[NOERROR] respflags:[qr aa rd] err:[<nil>] duration:[56.896142ms]
2024/05/16 23:31:43 worker:[442] reqid:[39501] qname:[test1.doh-test.com.] qtype:[TXT] respid:[39501] rcode:[NOERROR] respflags:[qr aa rd] err:[<nil>] duration:[55.309157ms]
2024/05/16 23:31:43 worker:[429] reqid:[15894] qname:[test1.doh-test.com.] qtype:[TXT] respid:[15894] rcode:[NOERROR] respflags:[qr aa rd] err:[<nil>] duration:[56.137099ms]
2024/05/16 23:31:43 worker:[1930] reqid:[55268] qname:[test1.doh-test.com.] qtype:[TXT] respid:[55268] rcode:[NOERROR] respflags:[qr aa rd] err:[<nil>] duration:[17.788711ms]

The zone has * 60 IN TXT "wildcard.doh-test.com!", but that doesn't show up in the logfile, I'm afraid.

Also, I see that there's some repetition going on:

# grep -c 'test1.doh-test.com' dnspyre.2024-05-16_16:31:28

4000

# grep -v -c 'test1.doh-test.com' dnspyre.2024-05-16_16:31:28

1302933

I thing that's because I specified -c 2000, and it would explain why the cache hit ratio is 100% 😄 (Well, 99.99975%)

Tantalor93 commented 1 month ago

oh, I see the problem now and the source of duplication. You specified twice flag -t TXT and the -t flag is repeatable so basically you are always sending TXT for each hostname twice, I would suggest removing the repeated flag

dnspyre --no-distribution --no-progress -t TXT --server 172.16.200.25:91 --tcp -d 10s -c 1  @infile.txt
Tantalor93 commented 1 month ago

and regarding the IOErrors I suspect that these are caused due to cancellation of benchmark based on time limit (-d 10s), though I would expect so many of those, have to check the code, whether the cancellation is done properly

the request log shows that nicely, that all the errors happens at the end of the Benchmark

PenelopeFudd commented 1 month ago

D'oh!

Glad you spotted that, I wouldn't have. 😄

About the request log, is it supposed to show the contents of the response? I'm getting weird results from PowerDNS and really want to confirm that the answers I get match the questions I sent.

Tantalor93 commented 1 month ago

About the request log, is it supposed to show the contents of the response? I'm getting weird results from PowerDNS and really want to confirm that the answers I get match the questions I sent.

no, response content (answer sections) are not part of the request log. Only response flags, response code and response ID is logged

PenelopeFudd commented 1 month ago

Could they be added? Otherwise I have to run some other program to check the server isn't corrupting the answers under load.