TechnitiumSoftware / DnsServer

Technitium DNS Server
https://technitium.com/dns/
GNU General Public License v3.0
4.27k stars 418 forks source link

Docker deployment - 100% CPU Utilization and not responding to queries #668

Closed jaydio closed 1 year ago

jaydio commented 1 year ago

Hi there,

I've got four authoritative DnsServer servers deployed using the official docker image.

This morning, at around 7:10am (UTC+8), all nodes simultaneously started consuming 100% of CPU cycles and stopped responding to DNS queries.

The following platforms are used (all on latest patch level):

  1. ns1 - Rocky Linux release 8.8 (Green Obsidian)
  2. ns2 - CentOS Linux release 7.9.2009 (Core)
  3. ns3 - Rocky Linux release 9.2 (Blue Onyx)
  4. ns4 - Ubuntu 18.04.6 LTS release (Bionic)

Docker version (identical across all platforms):

Client: Docker Engine - Community
 Version:           24.0.2
 API version:       1.43
 Go version:        go1.20.4
 Git commit:        cb74dfc
 Built:             Thu May 25 21:52:13 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.2
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       659604f
  Built:            Thu May 25 21:52:13 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.21
  GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
 runc:
  Version:          1.1.7
  GitCommit:        v1.1.7-0-g860f061
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Here are some graphs I've pulled from netdata:

image

image

image

image

image

PS output from the container itself:

%CPU CPU  NI S     TIME COMMAND
38.8   -   0 S 05:50:58 /usr/bin/dotnet /opt/technitium/dns/DnsServerApp.dll /etc/dns

I was able to pull a stack trace from the log file, but it happened before the CPU utilization started spiking. Just switched all logs to LOCAL and set the correct timezone for all containers as well.

[2023-06-20 03:57:28 UTC] DNS Server failed to resolve the request '116.241.94.184.in-addr.arpa. PTR IN'.

TechnitiumLibrary.Net.Dns.DnsClientNoResponseException: DnsClient failed to recursively resolve the request '116.241.94.184.in-addr.arpa. PTR IN': no response from name servers [adns3.ironport.com (184.94.240.167), adns4.ironport.com (184.94.240.168), adns1.ironport.com (208.90.58.36), adns2.ironport.com (208.90.58.37)].
 ---> TechnitiumLibrary.Net.Dns.DnsClientNoResponseException: DnsClient failed to resolve the request '116.241.94.184.in-addr.arpa. PTR IN': request timed out.
 ---> System.Net.Sockets.SocketException (110): Connection timed out
   at TechnitiumLibrary.Net.SocketExtensions.UdpQueryAsync(Socket socket, ArraySegment`1 request, ArraySegment`1 response, IPEndPoint remoteEP, Int32 timeout, Int32 retries, Boolean expBackoffTimeout, Func`2 isResponseValid, CancellationToken cancellationToken) in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\SocketExtensions.cs:line 144
   at TechnitiumLibrary.Net.Dns.ClientConnection.UdpClientConnection.QueryAsync(DnsDatagram request, Int32 timeout, Int32 retries, CancellationToken cancellationToken) in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\Dns\ClientConnection\UdpClientConnection.cs:line 235
   --- End of inner exception stack trace ---
   at TechnitiumLibrary.Net.Dns.ClientConnection.UdpClientConnection.QueryAsync(DnsDatagram request, Int32 timeout, Int32 retries, CancellationToken cancellationToken) in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\Dns\ClientConnection\UdpClientConnection.cs:line 235
   at TechnitiumLibrary.Net.Dns.DnsClient.<>c__DisplayClass72_0.<<InternalResolveAsync>g__DoResolveAsync|1>d.MoveNext() in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\Dns\DnsClient.cs:line 4092
--- End of stack trace from previous location ---
   at TechnitiumLibrary.Net.Dns.DnsClient.<>c__DisplayClass72_0.<<InternalResolveAsync>g__DoResolveAsync|1>d.MoveNext() in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\Dns\DnsClient.cs:line 4270
--- End of stack trace from previous location ---
   at TechnitiumLibrary.Net.Dns.DnsClient.<>c__DisplayClass72_0.<<InternalResolveAsync>g__DoResolveAsync|1>d.MoveNext() in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\Dns\DnsClient.cs:line 4020
--- End of stack trace from previous location ---
   at TechnitiumLibrary.Net.Dns.DnsClient.InternalResolveAsync(DnsDatagram request, CancellationToken cancellationToken) in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\Dns\DnsClient.cs:line 4371
   at TechnitiumLibrary.Net.Dns.DnsClient.RecursiveResolveAsync(DnsQuestionRecord question, IDnsCache cache, NetProxy proxy, Boolean preferIPv6, UInt16 udpPayloadSize, Boolean randomizeName, Boolean qnameMinimization, Boolean asyncNsRevalidation, Boolean dnssecValidation, NetworkAddress eDnsClientSubnet, Int32 retries, Int32 timeout, Int32 maxStackCount, Boolean cleanupResponse, Boolean asyncNsResolution, CancellationToken cancellationToken) in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\Dns\DnsClient.cs:line 1770
   --- End of inner exception stack trace ---
   at TechnitiumLibrary.Net.Dns.DnsClient.RecursiveResolveAsync(DnsQuestionRecord question, IDnsCache cache, NetProxy proxy, Boolean preferIPv6, UInt16 udpPayloadSize, Boolean randomizeName, Boolean qnameMinimization, Boolean asyncNsRevalidation, Boolean dnssecValidation, NetworkAddress eDnsClientSubnet, Int32 retries, Int32 timeout, Int32 maxStackCount, Boolean cleanupResponse, Boolean asyncNsResolution, CancellationToken cancellationToken) in Z:\Technitium\Projects\TechnitiumLibrary\TechnitiumLibrary.Net\Dns\DnsClient.cs:line 1770
   at DnsServerCore.Dns.DnsServer.RecursiveResolveAsync(DnsQuestionRecord question, NetworkAddress eDnsClientSubnet, Boolean conditionalForwardingClientSubnet, IReadOnlyList`1 conditionalForwarders, Boolean dnssecValidation, Boolean cachePrefetchOperation, Boolean cacheRefreshOperation, Boolean skipDnsAppAuthoritativeRequestHandlers, TaskCompletionSource`1 taskCompletionSource) in Z:\Technitium\Projects\DnsServer\DnsServerCore\Dns\DnsServer.cs:line 2929

Couple of additional notes:

ShreyasZare commented 1 year ago

Thanks for the feedback with details. I am not exactly sure why this could have happened. Did the issue stop on its own or did it require restarting the DNS servers? Any idea on available memory on the server during the issue?

The error log is unrelated to it since its just a reverse lookup which failed to resolve that usually takes place to show the domain name for Top Clients on the dashboard.

jaydio commented 1 year ago

Thanks for the feedback with details. I am not exactly sure why this could have happened. Did the issue stop on its own or did it require restarting the DNS servers? Any idea on available memory on the server during the issue?

Yes, it required restarting each container in order to restore service.

Each box has ample of memory.

Also seeing =< 200M memory utilization of the DnsServer container on any box.

Will update this issue if it happens again.

The error log is unrelated to it since its just a reverse lookup which failed to resolve that usually takes place to show the domain name for Top Clients on the dashboard.

Yeah, that's what I thought, was just the only stack trace I could find in the log file, so I included it.

jaydio commented 1 year ago

This kept on happening, but only when NSD was the master server. Had to switch platforms for this particular project and am unable to investigate this further. Thanks to @ShreyasZare for spending countless hours and also trying to reproduce this in the lab using my configs. Will close this ticket now.