setup DNS caching in VM's (sched's and TW/Publisher)

belforte commented 2 years ago

We got this[1] from automatic services in IT. We think that biggest problem for us is the number of HTTP Queries, which mostly come from schedd's ( #7002 ) but still it will be good to have as DNS caching in our VM's. Possibilities are the ones indicated by CERN below at http://service-dns.web.cern.ch/service-dns/faq.asp But also we can look at https://coredns.io/plugins/cache/

@novicecpp please look at the various possibilities, determine which one is the simplest and safest which meet our needs (especially in terms of long term support and maintenance) and make a proposal for deployment which we can review.

[1]

Dear cms-service-crab3htcondor@cern.ch

You are listed as responsible for crab-prod-tw01 (.cern.ch). 
Our DNS servers are warning that this host has been sending a VERY HIGH
rate of queries for the last hour (55.64333333333333 requests/sec).

Please, check the cause of this problem and sort it out
since it impacts the central DNS service performance. Please
also consult http://service-dns.web.cern.ch/service-dns/faq.asp
for information on setting up dns for high demanding clients (page accessible from CERN network only).

Should this problem continue, we will have to block this system
to avoid performance problems in the central DNS service.

        Thanks in advance,
                                    CERN Network Support

More info: crab-prod-tw01
|                        |               |                 On IP DNS servers | queries/sec during 15 min |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                    --- TOTAL ---  |       55 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                    cmsweb.cern.ch |      54.5 queries/sec     |
| crab-prod-tw01.cern.ch | 137.138.53.39 |               cmsweb-prod.cern.ch |      0.3 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                        s3.cern.ch |      0.2 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |  cmsgwms-collector-global.cern.ch |      0.1 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 | cmsgwms-collector-global.fnal.gov |      0.1 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                   myproxy.cern.ch |      0.1 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                 vocms0199.cern.ch |      0.0 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |             monit-metrics.cern.ch |      0.0 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |            nocontact-rest.cern.ch |      0.0 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |               agileinf-mb.cern.ch |      0.0 queries/sec      |
-cms-service-crab3htcondor@cern.ch-
cms-service-crab3htcondor@cern.ch

novicecpp commented 2 years ago

After discuss with @mapellidario (Thanks a lot!). We found out that WMCore’s team already uses nscd to cache domain names https://github.com/dmwm/WMCore/issues/9435, and every puppet managed VM (TaskWorker/Schedd) has nscd set up and running.

The problem is, in TaskWorker VM, we run our app inside docker container, we do not bind mount nscd socket to the container, so glibc cannot query nscd that runs outside the container. This is the reason why we have alert from TaskWorker machine only.

We only need to change ./runContainer.sh scripts, append -v /var/run/nscd/socket:/var/run/nscd/socket option to docker run command.

novicecpp commented 2 years ago

Done runContainer.sh are deployed by puppet.

mapellidario commented 2 years ago

We received another email concerning crab-prod-tw01 that does not cache DNS queries about cmsweb.cern.ch domain. We should have a look. Should we re-open this issue?

belforte commented 2 years ago

yes

novicecpp commented 2 years ago

What I got from investigating this morning:

systemctl status nscd to check if daemon has been crash/restart, but output report for 7 month.

[tseethon@crab-prod-tw01 ~]$ sudo systemctl status nscd -l                                   
● nscd.service - Name Service Cache Daemon                                                    
Loaded: loaded (/usr/lib/systemd/system/nscd.service; enabled; vendor preset: disabled)    
Active: active (running) since Thu 2021-10-21 08:49:22 CEST; 7 months 9 days ago           
Main PID: 8804 (nscd)                                                                        
Tasks: 9                                                                                  
Memory: 2.0M                                                                               
CGroup: /system.slice/nscd.service                                                         
       └─8804 /usr/sbin/nscd

journalctl -eu nscd (run as root) to see the full log. There is no restart that handle by systemd, but look like nscd restart itself every hour (crosscheck with ps, PID and match with the log lines).

[tseethon@crab-prod-tw01 ~]$ journalctl -eu nscd
May 30 21:10:34 crab-prod-tw01.cern.ch nscd[9640]: 9640 Access Vector Cache (AVC) started
May 30 22:09:49 crab-prod-tw01.cern.ch nscd[9640]: 9640 monitored file `/etc/resolv.conf` was 
May 30 22:10:34 crab-prod-tw01.cern.ch nscd[8402]: 8402 monitoring file `/etc/hosts` (1)
May 30 22:10:34 crab-prod-tw01.cern.ch nscd[8402]: 8402 monitoring directory `/etc` (2)
May 30 22:10:34 crab-prod-tw01.cern.ch nscd[8402]: 8402 monitoring file `/etc/resolv.conf` (3)
May 30 22:10:34 crab-prod-tw01.cern.ch nscd[8402]: 8402 monitoring directory `/etc` (2)
May 30 22:10:34 crab-prod-tw01.cern.ch nscd[8402]: 8402 monitoring file `/etc/services` (4)
May 30 22:10:34 crab-prod-tw01.cern.ch nscd[8402]: 8402 monitoring directory `/etc` (2)
May 30 22:10:34 crab-prod-tw01.cern.ch nscd[8402]: 8402 Access Vector Cache (AVC) started
May 30 23:10:40 crab-prod-tw01.cern.ch nscd[14238]: 14238 monitoring file `/etc/hosts` (1)
May 30 23:10:40 crab-prod-tw01.cern.ch nscd[14238]: 14238 monitoring directory `/etc` (2)
May 30 23:10:40 crab-prod-tw01.cern.ch nscd[14238]: 14238 monitoring file `/etc/resolv.conf` (
May 30 23:10:40 crab-prod-tw01.cern.ch nscd[14238]: 14238 monitoring directory `/etc` (2)
May 30 23:10:40 crab-prod-tw01.cern.ch nscd[14238]: 14238 monitoring file `/etc/services` (4)
May 30 23:10:40 crab-prod-tw01.cern.ch nscd[14238]: 14238 monitoring directory `/etc` (2)
May 30 23:10:40 crab-prod-tw01.cern.ch nscd[14238]: 14238 Access Vector Cache (AVC) started
May 30 23:36:34 crab-prod-tw01.cern.ch nscd[14238]: 14238 monitored file `/etc/resolv.conf` wa
May 31 00:11:06 crab-prod-tw01.cern.ch nscd[9590]: 9590 monitoring file `/etc/hosts` (1)
May 31 00:11:06 crab-prod-tw01.cern.ch nscd[9590]: 9590 monitoring directory `/etc` (2)
May 31 00:11:06 crab-prod-tw01.cern.ch nscd[9590]: 9590 monitoring file `/etc/resolv.conf` (3)
...
...
...
May 31 09:29:41 crab-prod-tw01.cern.ch nscd[22997]: 22997 monitored file `/etc/resolv.conf` wa
May 31 10:12:20 crab-prod-tw01.cern.ch nscd[8804]: 8804 monitoring file `/etc/hosts` (1)
May 31 10:12:20 crab-prod-tw01.cern.ch nscd[8804]: 8804 monitoring directory `/etc` (2)
May 31 10:12:20 crab-prod-tw01.cern.ch nscd[8804]: 8804 monitoring file `/etc/resolv.conf` (3)
May 31 10:12:20 crab-prod-tw01.cern.ch nscd[8804]: 8804 monitoring directory `/etc` (2)
May 31 10:12:20 crab-prod-tw01.cern.ch nscd[8804]: 8804 monitoring file `/etc/services` (4)
May 31 10:12:20 crab-prod-tw01.cern.ch nscd[8804]: 8804 monitoring directory `/etc` (2)
May 31 10:12:20 crab-prod-tw01.cern.ch nscd[8804]: 8804 Access Vector Cache (AVC) started
[tseethon@crab-prod-tw01 ~]$ ps uax | grep nscd                                               
nscd      8804  0.0  0.0 714084  2068 ?        Ssl  10:12   0:00 /usr/sbin/nscd               
tseethon 14058  0.0  0.0 112812   980 pts/0    S+   10:28   0:00 grep --color=auto nscd

I monitor the rate of DNS query to DNS server by using tcpdump -n -i eth0 'port 53' and see how fast tcpdump logs will flow. I did not have an exact number like query/s, but what I saw is logs flow much faster rate compared to before I restarted nscd. After restarting it for 5 mins, log flow rate was back to normal. So, I could assume that it worked as usual before I restarted it.

My conclusion:

I believed it just temporary issue on nscd, sometime daemon crash and/or restart.
It happened many time while finding solution of this problem, daemon still there but app cannot connect nscd socket.
We can ignore this for now.
If we got DNS alert again/frequently, I would change VM to use local DNS server (BIND9/CoreDNS) instead of nscd because it more reliable and we can monitor/test it if something wrong.

Note from meeting minutes:

Crab working finely but we seem to have lost DNS caching on crab-prod-tw01 (TW container). Wa thinks this can be explained as a temporary issue on nscd which at times crashes and restarts by itself. No action needed on our side. “systemctl status nscd” to check if daemon has been crash/restart, “journalctl -feu nscd” (run as root) to see the full log. (wa: I don’t know how to check if nscd is really worked or how to query nscd directly without query DNS server)
Bottom line: do not worry unless messages from CERN firewall monitoring keep coming !

belforte commented 2 years ago

thanks @novicecpp , I believe that we can close (again). If problem comes back we will find this to use as reference. I leave to you and Dario final decision on closing.

novicecpp commented 2 years ago

Closing

dmwm / CRABServer

setup DNS caching in VM's (sched's and TW/Publisher) #7003