eBayClassifiedsGroup / PanteraS

PanteraS - PaaS - Platform as a Service in a box
GNU General Public License v2.0
200 stars 61 forks source link

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150) #166

Closed cookandy closed 8 years ago

cookandy commented 8 years ago

Hi,

For some reason, recently, I've started experiencing the following error:

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

As a result, my lookups are failing for my marathon services. Do you have any idea what could be causing this, or if there's a way to look at the dnsmasq logs to determine where the queries are coming from?

Thanks!

sielaq commented 8 years ago

you can raise default (150) value of concurrent queries: -0 300 or --dns-forward-max=300 you can add it by:

echo 'DNSMASQ_PARAMS="--dns-forward-max=300"' >> restricted/host
sielaq commented 8 years ago

you can also use consul as your default DNS instead of dnsmasq by:

export START_DNSMASQ=false
export CONSUL_PARAMS="-config-dir=/etc/consul.d/"
./generate_yml.sh
cookandy commented 8 years ago

Ok, thanks @sielaq! I will try to update that change. The weird thing is, I am not sure why I am suddenly getting this error. I haven't added any new servers to the cluster, so I wouldn't expect any more DNS traffic than usual. Is there anything inside of PanteraS that would be flooding dnsmasq? Lastly, do you know if there's a way to look closer at the dnsmasq logs to see which host is causing the problem?

sielaq commented 8 years ago

if this is caused by one of your service then yes, it looks like your service is flooding. PanteraS infrastructure itself is not using much DNS.

sielaq commented 8 years ago

can you trace what exactly is flooding ?

cookandy commented 8 years ago

Strange because I don't actually have any services running and I still get the error.

cookandy commented 8 years ago

Can I look at the dnsmasq logs to determine which server is the culprit? I've only got 2 slaves and 1 master and they're all configured to use the master as the DNS server... so I wouldn't expect it to be flooded...

sielaq commented 8 years ago

apt-get install ngrep tcpdump check whats going on: ngrep -l -q udp and port 53 or tcpdump -vvv -s 0 -l -n port 53

sielaq commented 8 years ago

yea you can log queries too, additional dnsmasq option: -q you should see it in docker-compose logs

cookandy commented 8 years ago

Thanks! I ran the commands and originally saw a bunch of traffic (probably outbound) going to my corporate DNS servers. So I removed the corp DNS servers from /etc/resolv.conf (so my master node is the only nameserver listed) and re-ran the commands.

I never saw any traffic (using either of the commands), yet, I'm still experiencing the error. I have stopped and removed the PanteraS container, but I'm still seeing the issue.

cookandy commented 8 years ago

I'm actually seeing the error immediately after startup:

panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | started, version 2.68 cachesize 150 dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | using nameserver 10.7.112.99#8600 for domain consul dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  mesos-master stderr | I0107 20:41:21.575283    66 contender.cpp:149] Joining the ZK group
panteras_1 | dnsmasq stderr | reading /etc/resolv.conf.orig dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | using nameserver 10.7.9.12#53 dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq:  dnsmasq stderr | using nameserver 10.7.9.11#53 dnsmasq stderr |
panteras_1 | dnsmasq: using nameserver 10.7.112.99#53
panteras_1 | dnsmasq: using nameserver 10.7.112.99#8600 for domain consul
panteras_1 | dnsmasq: read /etc/hosts - 9 addresses
panteras_1 | chronos stdout | [2016-01-07 20:41:22,283] INFO --------------------- (org.apache.mesos.chronos.scheduler.Main$:26)
panteras_1 | chronos stdout | [2016-01-07 20:41:22,287] INFO Initializing chronos. (org.apache.mesos.chronos.scheduler.Main$:27)
panteras_1 | chronos stdout | [2016-01-07 20:41:22,290] INFO --------------------- (org.apache.mesos.chronos.scheduler.Main$:28)
panteras_1 | marathon stdout | [2016-01-07 20:41:22,901] INFO Starting Marathon 0.13.0 with --master zk://C01NHVD652:2181/mesos --zk zk://C01NHVD652:2181/marathon --hostname c01nhvd652.nh.corp --http_address 0.0.0.0 --https_address 0.0.0.0 (mesosphere.marathon.Main$:main)
panteras_1 | consul stdout | ==> WARNING: BootstrapExpect Mode is specified as 1; this is the same as Bootstrap mode.
panteras_1 | ==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
panteras_1 | ==> Starting Consul agent...
panteras_1 | consul stdout | ==> Starting Consul agent RPC...
panteras_1 | ==> Joining cluster...
panteras_1 | dnsmasq stderr | dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

How can dnsmasq be saturated after it has only been running for less than 5 seconds?

sielaq commented 8 years ago

" I have stopped and removed the PanteraS container, but I'm still seeing the issue." I'm confused now, is it related to PanteraS or not ?

cookandy commented 8 years ago

Yes, I believe so. Right now I've only got a single server (master) running. When I start fresh with a new docker-compose up, I see the error during startup (as shown above). I don't understand how dnsmasq can reach its max connections immediately during startup.

sielaq commented 8 years ago

login into the PaaS container like docker exec -ti panteras_panteras_1 bash (name of container might be different ) check supervisorctl status multiple times if some services are running or re spawning all the time (PID should change) - might be one of the service have problems.

if all is stable, stop everything except dnsmasq , (supervisorctl stop <service>) and check if you still have problems. I suspect that this is some external application.

cookandy commented 8 years ago

Can you confirm the following requests are showing inbound to 10.7.112.99?

U 10.7.9.12:53 -> 10.7.112.99:19762
  v............MASTER-NODE.....
U 10.7.9.11:53 -> 10.7.112.99:19762
  .............MASTER-NODE.....
U 10.7.9.12:53 -> 10.7.112.99:19762
  .............MASTER-NODE.....
U 10.7.9.11:53 -> 10.7.112.99:19762

My master node is 10.7.112.99, and my corp DNS servers are 10.7.9.11 and 10.7.9.12. I only see these requests coming in when PanteraS is running, but it appears they're coming into port 19762.

sielaq commented 8 years ago

This is rather the response, 53 answer to port 19762. Something is running on 10.7.112.99 that asks 10.7.9.11:53:53

sielaq commented 8 years ago

ahh 10.7.112.99 is your PanetraS. Did you try to stop mesos/marathon/zookeper with supervisorctl as I asked ?

cookandy commented 8 years ago

Yes, 10.7.112.99 is my only PanteraS. I tried what you suggested with supervisord and I think the problem is with consul. First, the services don't appear to be restarting on their own. You can tell by the uptime:

chronos                          RUNNING   pid 23, uptime 0:25:02
consul                           RUNNING   pid 122, uptime 0:25:00
consul-template_haproxy          STOPPED   Not started
dnsmasq                          RUNNING   pid 15, uptime 0:25:02
marathon                         RUNNING   pid 19, uptime 0:25:02
mesos-master                     RUNNING   pid 18, uptime 0:25:02
mesos-slave                      STOPPED   Not started
registrator                      STOPPED   Not started
stdout                           RUNNING   pid 14, uptime 0:25:02
zookeeper                        RUNNING   pid 17, uptime 0:25:02

I stopped each of the services, one by one, until only dnsmasq was left running:

chronos                          STOPPED   Jan 07 09:25 PM
consul                           STOPPED   Jan 07 09:25 PM
consul-template_haproxy          STOPPED   Not started
dnsmasq                          RUNNING   pid 15, uptime 0:29:25
marathon                         STOPPED   Jan 07 09:25 PM
mesos-master                     STOPPED   Jan 07 09:26 PM
mesos-slave                      STOPPED   Not started
registrator                      STOPPED   Not started
stdout                           STOPPED   Jan 07 09:26 PM
zookeeper                        STOPPED   Jan 07 09:26 PM

At this point, I was still receiving requests on port 53 as shown in my previous comment. So I stopped dnsmasq, and then the requests stopped.

When I started dnsmasq again, I did not see the requests coming in. So I started each of the services one by one, and started noticing the flood immediately after starting consul. Could this have to do with the consul LISTEN_IP changes recently?

cookandy commented 8 years ago

I stopped just consul and dnsmasq and the flooding stopped. I then started dnsmasq and didn't see the flooding. Only after starting consul do I see the flooding...

cookandy commented 8 years ago

I went back to a build before the consul upgrade to 0.6 and the problem has gone away. So I'm thinking it's either something to do with the new version of consul, or the new LISTEN arguments in the docker-compose.

cookandy commented 8 years ago

I reverted to commit eec2c7673ebd49aa6a240234725d07aa79328e3d (before consul was updated to 0.6), and everything works as expected. Are you experiencing this issue?

cookandy commented 8 years ago

Ok, after hours of switching between commits, I finally got it working on the latest build. I haven't changed anything, so I'm not sure what has happened. I opened up another issue asking about proper DNS configuration. I'll close this one for now. Thanks for the help @sielaq!

cookandy commented 8 years ago

Ok, actually, I'm reopening this issue because I've gotten to the point where I can get my masters and slaves up and running without the DNS flood. However, the first docker container I spawn with Marathon causes the issue.

I start a docker image with marathon and it spawns a container ID of fe7c4453eb31. After that, I immediately start seeing the following DNS messages being flooded:

U 10.7.9.11:53 -> 10.7.112.99:56479
  .q...........fe7c4453eb31.....
U 10.7.112.99:56479 -> 10.7.9.12:53
  .q...........fe7c4453eb31.....
U 10.7.112.99:56479 -> 10.7.9.11:53
  .q...........fe7c4453eb31.....
U 10.7.9.12:53 -> 10.7.112.99:56479
  .q...........fe7c4453eb31.....
U 10.7.9.11:53 -> 10.7.112.99:56479
  .q...........fe7c4453eb31.....

As you can see, the query in the above snippet is referencing fe7c4453eb31. So it appears the docker container is trying to use DNS for some reason.

Even after I stop and remove the container, I still continue to see the DNS floods with the same reference to fe7c4453eb31. The only way I can get the flooding to stop is to stop all masters and slaves and restart.

Do you have any ideas of why this would be happening?

sielaq commented 8 years ago

Yea I also don't think it is a consul issue.

my 2 hypotheses :

  1. Looks like some loop for me. PanteraS asks your DNS U 10.7.112.99:56479 -> 10.7.9.12:53 which means it looks for your DNS instead of himself - where to looks for, then your DNS returns back (probably forward zones) that PanteraS should (know) resolve it. And circle repeats and flood starts. Can you show me from PanteraS container env | grep DNS
  2. It could be connected with last change, but I cannot reproduce it. Can you check if problem still exists, after ./generateyaml.sh remove stuff we have added: --bind-interfaces --listen-address=x.x.x.x from docker-compose.yml where DNSMASQ_APP_PARAMS is configured and run fresh:
docker-compose stop
docker-compose rm --force
docker-compose up -d
cookandy commented 8 years ago

I tend to agree with your theory about the DNS loop. Is there any way to tell my PanteraS hosts to only send requests to the corporate DNS servers if they are not service.consul lookups?

Here is the output of my env | grep DNS from each container:

master (IP address 10.0.0.10)

DNSMASQ_APP_PARAMS=-d  -u dnsmasq  -r /etc/resolv.conf.orig  -7 /etc/dnsmasq.d  --server=/consul/10.0.0.10#8600  --host-record=master-1.mydomain.corp,10.0.0.10  --bind-interfaces  --listen-address=0.0.0.0

slave 1 (IP address 10.0.0.11)

DNSMASQ_APP_PARAMS=-d  -u dnsmasq  -r /etc/resolv.conf.orig  -7 /etc/dnsmasq.d  --server=/consul/10.0.0.11#8600  --host-record=slave-1.mydomain.corp,10.0.0.11  --bind-interfaces  --listen-address=0.0.0.0  --address=/consul/10.0.0.11

slave 2 (IP address 10.0.0.12)

DNSMASQ_APP_PARAMS=-d  -u dnsmasq  -r /etc/resolv.conf.orig  -7 /etc/dnsmasq.d  --server=/consul/10.0.0.12#8600  --host-record=slave-2.mydomain.corp,10.0.0.12  --bind-interfaces  --listen-address=0.0.0.0  --address=/consul/10.0.0.12

You are correct. Removing the --bind-interfaces --listen-address=x.x.x.x from the generate_yml.sh hasn't seemed to make a difference. I verified that the parameters were gone by running env | grep DNS from within the running containers again.

cookandy commented 8 years ago

Actually, I retract my last comment. Removing the --bind-interfaces --listen-address=x.x.x.x from the generate_yml.sh has fixed the issue! I thought I saw flooding, but it was just our service starting up checking for updates, etc. With the --bind-interfaces --listen-address=x.x.x.x options our service never actually starts, but rather just gets stuck in the lookup loop (and continues to do so, even after the container has been removed/destroyed).

cookandy commented 8 years ago

More specifically, it seems to be the --bind-interfaces argument causing the problem. I'm not doing anything fancy with regards to networking. I've only got a single NIC (eth0) on each of the masters and slaves, and only have the one additional interface for docker0.

sielaq commented 8 years ago

I was able to reproduce it. That was very unexpected behavior from dnsmasq. I have made correction.

cookandy commented 8 years ago

Actually, it looks like --listen-address is also causing issues for my running container. When I start a container which calls out to lots of urls (maybe > 100), I notice a considerable delay in starting the service. My service normally takes around 2 minutes to start, and with the --listen-address active, it takes approximately 16 minutes.

The only change I have made is the --listen-address under DNSMASQ_APP_PARAMS in docker-compose.yml, so I know for sure it is the culprit.

Have you noticed the same? Should I open a new issue?

sielaq commented 8 years ago

I will fix it with this issue.