Closed cookandy closed 8 years ago
you can raise default (150) value of concurrent queries:
-0 300
or --dns-forward-max=300
you can add it by:
echo 'DNSMASQ_PARAMS="--dns-forward-max=300"' >> restricted/host
you can also use consul as your default DNS instead of dnsmasq by:
export START_DNSMASQ=false
export CONSUL_PARAMS="-config-dir=/etc/consul.d/"
./generate_yml.sh
Ok, thanks @sielaq! I will try to update that change. The weird thing is, I am not sure why I am suddenly getting this error. I haven't added any new servers to the cluster, so I wouldn't expect any more DNS traffic than usual. Is there anything inside of PanteraS that would be flooding dnsmasq? Lastly, do you know if there's a way to look closer at the dnsmasq logs to see which host is causing the problem?
if this is caused by one of your service then yes, it looks like your service is flooding. PanteraS infrastructure itself is not using much DNS.
can you trace what exactly is flooding ?
Strange because I don't actually have any services running and I still get the error.
Can I look at the dnsmasq logs to determine which server is the culprit? I've only got 2 slaves and 1 master and they're all configured to use the master as the DNS server... so I wouldn't expect it to be flooded...
apt-get install ngrep tcpdump
check whats going on:
ngrep -l -q udp and port 53
or
tcpdump -vvv -s 0 -l -n port 53
yea you can log queries too, additional dnsmasq
option:
-q
you should see it in docker-compose logs
Thanks! I ran the commands and originally saw a bunch of traffic (probably outbound) going to my corporate DNS servers. So I removed the corp DNS servers from /etc/resolv.conf
(so my master node is the only nameserver
listed) and re-ran the commands.
I never saw any traffic (using either of the commands), yet, I'm still experiencing the error. I have stopped and removed the PanteraS container, but I'm still seeing the issue.
I'm actually seeing the error immediately after startup:
panteras_1 | dnsmasq stderr | dnsmasq: dnsmasq stderr | started, version 2.68 cachesize 150 dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq: dnsmasq stderr | compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq: dnsmasq stderr | using nameserver 10.7.112.99#8600 for domain consul dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq: mesos-master stderr | I0107 20:41:21.575283 66 contender.cpp:149] Joining the ZK group
panteras_1 | dnsmasq stderr | reading /etc/resolv.conf.orig dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq: dnsmasq stderr | using nameserver 10.7.9.12#53 dnsmasq stderr |
panteras_1 | dnsmasq stderr | dnsmasq: dnsmasq stderr | using nameserver 10.7.9.11#53 dnsmasq stderr |
panteras_1 | dnsmasq: using nameserver 10.7.112.99#53
panteras_1 | dnsmasq: using nameserver 10.7.112.99#8600 for domain consul
panteras_1 | dnsmasq: read /etc/hosts - 9 addresses
panteras_1 | chronos stdout | [2016-01-07 20:41:22,283] INFO --------------------- (org.apache.mesos.chronos.scheduler.Main$:26)
panteras_1 | chronos stdout | [2016-01-07 20:41:22,287] INFO Initializing chronos. (org.apache.mesos.chronos.scheduler.Main$:27)
panteras_1 | chronos stdout | [2016-01-07 20:41:22,290] INFO --------------------- (org.apache.mesos.chronos.scheduler.Main$:28)
panteras_1 | marathon stdout | [2016-01-07 20:41:22,901] INFO Starting Marathon 0.13.0 with --master zk://C01NHVD652:2181/mesos --zk zk://C01NHVD652:2181/marathon --hostname c01nhvd652.nh.corp --http_address 0.0.0.0 --https_address 0.0.0.0 (mesosphere.marathon.Main$:main)
panteras_1 | consul stdout | ==> WARNING: BootstrapExpect Mode is specified as 1; this is the same as Bootstrap mode.
panteras_1 | ==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
panteras_1 | ==> Starting Consul agent...
panteras_1 | consul stdout | ==> Starting Consul agent RPC...
panteras_1 | ==> Joining cluster...
panteras_1 | dnsmasq stderr | dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)
How can dnsmasq be saturated after it has only been running for less than 5 seconds?
" I have stopped and removed the PanteraS container, but I'm still seeing the issue." I'm confused now, is it related to PanteraS or not ?
Yes, I believe so. Right now I've only got a single server (master) running. When I start fresh with a new docker-compose up
, I see the error during startup (as shown above). I don't understand how dnsmasq can reach its max connections immediately during startup.
login into the PaaS container like docker exec -ti panteras_panteras_1 bash
(name of container might be different )
check supervisorctl status
multiple times if some services are running
or re spawning all the time (PID should change) - might be one of the service have problems.
if all is stable, stop everything except dnsmasq
, (supervisorctl stop <service>
) and check if you still have problems.
I suspect that this is some external application.
Can you confirm the following requests are showing inbound to 10.7.112.99
?
U 10.7.9.12:53 -> 10.7.112.99:19762
v............MASTER-NODE.....
U 10.7.9.11:53 -> 10.7.112.99:19762
.............MASTER-NODE.....
U 10.7.9.12:53 -> 10.7.112.99:19762
.............MASTER-NODE.....
U 10.7.9.11:53 -> 10.7.112.99:19762
My master node is 10.7.112.99
, and my corp DNS servers are 10.7.9.11
and 10.7.9.12
. I only see these requests coming in when PanteraS is running, but it appears they're coming into port 19762
.
This is rather the response, 53 answer to port 19762. Something is running on 10.7.112.99 that asks 10.7.9.11:53:53
ahh 10.7.112.99 is your PanetraS. Did you try to stop mesos/marathon/zookeper with supervisorctl as I asked ?
Yes, 10.7.112.99
is my only PanteraS. I tried what you suggested with supervisord
and I think the problem is with consul. First, the services don't appear to be restarting on their own. You can tell by the uptime:
chronos RUNNING pid 23, uptime 0:25:02
consul RUNNING pid 122, uptime 0:25:00
consul-template_haproxy STOPPED Not started
dnsmasq RUNNING pid 15, uptime 0:25:02
marathon RUNNING pid 19, uptime 0:25:02
mesos-master RUNNING pid 18, uptime 0:25:02
mesos-slave STOPPED Not started
registrator STOPPED Not started
stdout RUNNING pid 14, uptime 0:25:02
zookeeper RUNNING pid 17, uptime 0:25:02
I stopped each of the services, one by one, until only dnsmasq was left running:
chronos STOPPED Jan 07 09:25 PM
consul STOPPED Jan 07 09:25 PM
consul-template_haproxy STOPPED Not started
dnsmasq RUNNING pid 15, uptime 0:29:25
marathon STOPPED Jan 07 09:25 PM
mesos-master STOPPED Jan 07 09:26 PM
mesos-slave STOPPED Not started
registrator STOPPED Not started
stdout STOPPED Jan 07 09:26 PM
zookeeper STOPPED Jan 07 09:26 PM
At this point, I was still receiving requests on port 53 as shown in my previous comment. So I stopped dnsmasq, and then the requests stopped.
When I started dnsmasq again, I did not see the requests coming in. So I started each of the services one by one, and started noticing the flood immediately after starting consul
. Could this have to do with the consul LISTEN_IP changes recently?
I stopped just consul and dnsmasq and the flooding stopped. I then started dnsmasq and didn't see the flooding. Only after starting consul do I see the flooding...
I went back to a build before the consul upgrade to 0.6
and the problem has gone away. So I'm thinking it's either something to do with the new version of consul, or the new LISTEN
arguments in the docker-compose.
I reverted to commit eec2c7673ebd49aa6a240234725d07aa79328e3d (before consul was updated to 0.6), and everything works as expected. Are you experiencing this issue?
Ok, after hours of switching between commits, I finally got it working on the latest build. I haven't changed anything, so I'm not sure what has happened. I opened up another issue asking about proper DNS configuration. I'll close this one for now. Thanks for the help @sielaq!
Ok, actually, I'm reopening this issue because I've gotten to the point where I can get my masters and slaves up and running without the DNS flood. However, the first docker container I spawn with Marathon causes the issue.
I start a docker image with marathon and it spawns a container ID of fe7c4453eb31
. After that, I immediately start seeing the following DNS messages being flooded:
U 10.7.9.11:53 -> 10.7.112.99:56479
.q...........fe7c4453eb31.....
U 10.7.112.99:56479 -> 10.7.9.12:53
.q...........fe7c4453eb31.....
U 10.7.112.99:56479 -> 10.7.9.11:53
.q...........fe7c4453eb31.....
U 10.7.9.12:53 -> 10.7.112.99:56479
.q...........fe7c4453eb31.....
U 10.7.9.11:53 -> 10.7.112.99:56479
.q...........fe7c4453eb31.....
As you can see, the query in the above snippet is referencing fe7c4453eb31
. So it appears the docker container is trying to use DNS for some reason.
Even after I stop and remove the container, I still continue to see the DNS floods with the same reference to fe7c4453eb31
. The only way I can get the flooding to stop is to stop all masters and slaves and restart.
Do you have any ideas of why this would be happening?
Yea I also don't think it is a consul issue.
my 2 hypotheses :
U 10.7.112.99:56479 -> 10.7.9.12:53
which means it looks for your DNS instead of himself - where to looks for,
then your DNS returns back (probably forward zones) that PanteraS should (know) resolve it.
And circle repeats and flood starts.
Can you show me from PanteraS container env | grep DNS
./generateyaml.sh
remove stuff we have added: --bind-interfaces --listen-address=x.x.x.x
from docker-compose.yml
where DNSMASQ_APP_PARAMS
is configured and run fresh:docker-compose stop
docker-compose rm --force
docker-compose up -d
I tend to agree with your theory about the DNS loop. Is there any way to tell my PanteraS hosts to only send requests to the corporate DNS servers if they are not service.consul lookups?
Here is the output of my env | grep DNS
from each container:
master (IP address 10.0.0.10)
DNSMASQ_APP_PARAMS=-d -u dnsmasq -r /etc/resolv.conf.orig -7 /etc/dnsmasq.d --server=/consul/10.0.0.10#8600 --host-record=master-1.mydomain.corp,10.0.0.10 --bind-interfaces --listen-address=0.0.0.0
slave 1 (IP address 10.0.0.11)
DNSMASQ_APP_PARAMS=-d -u dnsmasq -r /etc/resolv.conf.orig -7 /etc/dnsmasq.d --server=/consul/10.0.0.11#8600 --host-record=slave-1.mydomain.corp,10.0.0.11 --bind-interfaces --listen-address=0.0.0.0 --address=/consul/10.0.0.11
slave 2 (IP address 10.0.0.12)
DNSMASQ_APP_PARAMS=-d -u dnsmasq -r /etc/resolv.conf.orig -7 /etc/dnsmasq.d --server=/consul/10.0.0.12#8600 --host-record=slave-2.mydomain.corp,10.0.0.12 --bind-interfaces --listen-address=0.0.0.0 --address=/consul/10.0.0.12
You are correct. Removing the --bind-interfaces --listen-address=x.x.x.x
from the generate_yml.sh
hasn't seemed to make a difference. I verified that the parameters were gone by running env | grep DNS
from within the running containers again.
Actually, I retract my last comment. Removing the --bind-interfaces --listen-address=x.x.x.x
from the generate_yml.sh
has fixed the issue! I thought I saw flooding, but it was just our service starting up checking for updates, etc. With the --bind-interfaces --listen-address=x.x.x.x
options our service never actually starts, but rather just gets stuck in the lookup loop (and continues to do so, even after the container has been removed/destroyed).
More specifically, it seems to be the --bind-interfaces
argument causing the problem. I'm not doing anything fancy with regards to networking. I've only got a single NIC (eth0
) on each of the masters and slaves, and only have the one additional interface for docker0
.
I was able to reproduce it.
That was very unexpected behavior from dnsmasq
. I have made correction.
Actually, it looks like --listen-address
is also causing issues for my running container. When I start a container which calls out to lots of urls (maybe > 100), I notice a considerable delay in starting the service. My service normally takes around 2 minutes to start, and with the --listen-address
active, it takes approximately 16 minutes.
The only change I have made is the --listen-address
under DNSMASQ_APP_PARAMS
in docker-compose.yml
, so I know for sure it is the culprit.
Have you noticed the same? Should I open a new issue?
I will fix it with this issue.
Hi,
For some reason, recently, I've started experiencing the following error:
As a result, my lookups are failing for my marathon services. Do you have any idea what could be causing this, or if there's a way to look at the dnsmasq logs to determine where the queries are coming from?
Thanks!