Consul_agent is deleting dns entries in resolv.conf

cloudfoundry-attic / consul-release

This is a BOSH release for consul.

Apache License 2.0

10 stars 30 forks source link

Consul_agent is deleting dns entries in resolv.conf #9

Closed vvraskin closed 8 years ago

vvraskin commented 8 years ago

The commit f57c7e58f1665369f040a459a46d6a0036aaace9 is partially fixing the problem. However, in the case if resolvconf updates is enabled it does not protect from changes of /etc/resolv.conf file without writing to /etc/resolvconf/resolv.conf.d/head.

So If my /etc/resolv.conf is updated by bosh_agent or any bosh job before consul agent is deployed these entries will be wiped by consul_agent start script.

My suggestion would be to check the content of /etc/resolv.conf even if the resolvconf updates is enabled, and add the already existing config to /etc/resolvconf/resolv.conf.d/tail and then execute 'resolvconfig -u'. (I assume it must pick up the tail while updating it)

cf-gitbot commented 8 years ago

We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/112226229.

Amit-PivotalLabs commented 8 years ago

@vvraskin the consul_agent start script will only mess with /etc/resolv.conf if resolvconf updates are disabled, which is not the case on the AWS, vSphere, and OpenStack stemcells, and is not even the case on newer BOSH-Lite stemcells. We plan to remove that logic altogether at some point soon: https://www.pivotaltracker.com/story/show/109858930.

In the mean time, is this issue affecting you in your setup? Can it be fixed by moving to a more recent BOSH-Lite stemcell?

vvraskin commented 8 years ago

Hi @Amit-PivotalLabs. Thanks for the update.

In our case the resolvconf updates are enabled since bosh agent is using it to update DNS. But this option enabled does not block anyone from updating /etc/resolv.conf directly. For example we have some workarounds that change the sequence of DNS entries for diego nodes directly operating on /etc/resolv.conf. One more point to mess the dns. Maybe we need to think about using resolvconf -u as well.

We operate on Softlayer, need to check when the latest stemcell will be picked up.

Thank you.

Amit-PivotalLabs commented 8 years ago

@vvraskin the old behaviour was:

On BOSH-Lite, prepend 127.0.0.1 to the top of /etc/resolv.conf
On other stemcells, add 127.0.0.1 to /etc/resolv.conf.d/head` (can't remember the exact name of file)

Both had the same effect of making the local consul agent the primary DNS server, and then falling back to other DNS servers configured by the BOSH agent. We saw this eventually lead to some strange 5s or 10s timeouts for certain requests.

We solved that by configuring consul_agent to act as a recursor, so that it would recurse to the other DNS servers in /etc/resolv.conf if it could not resolve something (i.e. anything other than *.service.cf.internal).

Prior to the recursor feature in consul_agent, the workaround people were trying was to move 127.0.0.1 to the bottom of /etc/resolv.conf. So if that's the workaround you're doing for this reason, you might just want to pick up a newer consul release that has the recursor behaviour.

vvraskin commented 8 years ago

@Amit-PivotalLabs Our workaround was the changing of the position of google dns on diego cells. The issue was that diego cells were failing to resolve the address of github at the staging phase, so the response retrieved from PowerDNS did not allow to proceed to the next dns entry.

Now I think that the problem is in our bosh director, we are going to update it in the upcoming days, check the cells and remove our scripts. Then if no other components is changing the resolv.conf entries, I think consul should not make any problems.

Thank you!

Amit-PivotalLabs commented 8 years ago

@vvraskin I'll close out this issue for now. If you hit problems after trying your updates, please feel free to open up a new issue.

Best, Amit