dmwm / CRABServer

15 stars 38 forks source link

setup DNS caching in VM's (sched's and TW/Publisher) #7003

Closed belforte closed 2 years ago

belforte commented 2 years ago

We got this[1] from automatic services in IT. We think that biggest problem for us is the number of HTTP Queries, which mostly come from schedd's ( #7002 ) but still it will be good to have as DNS caching in our VM's. Possibilities are the ones indicated by CERN below at http://service-dns.web.cern.ch/service-dns/faq.asp But also we can look at https://coredns.io/plugins/cache/

@novicecpp please look at the various possibilities, determine which one is the simplest and safest which meet our needs (especially in terms of long term support and maintenance) and make a proposal for deployment which we can review.

[1]

Dear cms-service-crab3htcondor@cern.ch

You are listed as responsible for crab-prod-tw01 (.cern.ch). 
Our DNS servers are warning that this host has been sending a VERY HIGH
rate of queries for the last hour (55.64333333333333 requests/sec).

Please, check the cause of this problem and sort it out
since it impacts the central DNS service performance. Please
also consult http://service-dns.web.cern.ch/service-dns/faq.asp
for information on setting up dns for high demanding clients (page accessible from CERN network only).

Should this problem continue, we will have to block this system
to avoid performance problems in the central DNS service.

        Thanks in advance,
                                    CERN Network Support

More info: crab-prod-tw01
|                        |               |                 On IP DNS servers | queries/sec during 15 min |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                    --- TOTAL ---  |       55 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                    cmsweb.cern.ch |      54.5 queries/sec     |
| crab-prod-tw01.cern.ch | 137.138.53.39 |               cmsweb-prod.cern.ch |      0.3 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                        s3.cern.ch |      0.2 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |  cmsgwms-collector-global.cern.ch |      0.1 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 | cmsgwms-collector-global.fnal.gov |      0.1 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                   myproxy.cern.ch |      0.1 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |                 vocms0199.cern.ch |      0.0 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |             monit-metrics.cern.ch |      0.0 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |            nocontact-rest.cern.ch |      0.0 queries/sec      |
| crab-prod-tw01.cern.ch | 137.138.53.39 |               agileinf-mb.cern.ch |      0.0 queries/sec      |
-cms-service-crab3htcondor@cern.ch-
cms-service-crab3htcondor@cern.ch  
novicecpp commented 2 years ago

After discuss with @mapellidario (Thanks a lot!). We found out that WMCore’s team already uses nscd to cache domain names https://github.com/dmwm/WMCore/issues/9435, and every puppet managed VM (TaskWorker/Schedd) has nscd set up and running.

The problem is, in TaskWorker VM, we run our app inside docker container, we do not bind mount nscd socket to the container, so glibc cannot query nscd that runs outside the container. This is the reason why we have alert from TaskWorker machine only.

We only need to change ./runContainer.sh scripts, append -v /var/run/nscd/socket:/var/run/nscd/socket option to docker run command.

novicecpp commented 2 years ago

Done runContainer.sh are deployed by puppet.

mapellidario commented 2 years ago

We received another email concerning crab-prod-tw01 that does not cache DNS queries about cmsweb.cern.ch domain. We should have a look. Should we re-open this issue?

belforte commented 2 years ago

yes

novicecpp commented 2 years ago

What I got from investigating this morning:

My conclusion:

Note from meeting minutes:

Crab working finely but we seem to have lost DNS caching on crab-prod-tw01 (TW container). Wa thinks this can be explained as a temporary issue on nscd which at times crashes and restarts by itself. No action needed on our side. “systemctl status nscd” to check if daemon has been crash/restart, “journalctl -feu nscd” (run as root) to see the full log. (wa: I don’t know how to check if nscd is really worked or how to query nscd directly without query DNS server)
Bottom line: do not worry unless messages from CERN firewall monitoring keep coming !
belforte commented 2 years ago

thanks @novicecpp , I believe that we can close (again). If problem comes back we will find this to use as reference. I leave to you and Dario final decision on closing.

novicecpp commented 2 years ago

Closing