bosh-dns query resolves ip from other deployment

cloudfoundry / bosh-dns-release

BOSH DNS release

Apache License 2.0

18 stars 36 forks source link

bosh-dns query resolves ip from other deployment #77

Closed kinjelom closed 3 years ago

kinjelom commented 3 years ago

Steps to reproduce (works only in some cases):

create deployment A with 2 nodes: A/0, A/1
dig q-s0.mg.default.a.bosh - two ips
bosh delete-vm A/0
dig q-s0.mg.default.a.bosh - one ip
create deployment B with 2 nodes: B/0, B/1 - in some cases deployment B uses ip of deleted VM A/0=B/0
dig q-s0.mg.default.a.bosh - two ips: A/0=B/0, A/1

I don't know where the bug is: bosh-dns, bosh-director or maybe bosh-cpi for OpenStack.

I think that bosh-dns should not consider as healthy machines that responds from other deployment that being queried.

klakin-pivotal commented 3 years ago

I'm a little confused about step 6.

You mention that you get back two IPs, A/0 and B/0... however you also mention: a) Deleting the A/0 VM b) Sometimes one of the B instances uses A/0's IP address.

Some questions: 1) dig in step 6 returns two A records, yes? 2) If it does return two A records, are the IPs different? 3) If two A records are returned, how do you know that one corresponds to A/0 and the other B/0?

kinjelom commented 3 years ago

@klakin-pivotal sorry, my mistake - now I have corrected the description of steps 5 and 6.

It really happened and one Postgres Haproxy was redirecting to the nodes from different cluster/deployment. Unfortunately, despite many attempts, I was unable to recreate this situation.

rkoster commented 3 years ago

Sound like the instance from which the dig was performed was not yet updated with the latest DNS entries. These are synced by the sync-dns process (logs: /var/vcap/sys/log/director/sync_dns.stdout.log). Maybe that process was having problems at the time our your test. Normally it should sync dns entries every 10 seconds. So with old state bosh-dns would just perform a local health check to B/0 and think A/0 is back.

Point 4. could be explained by bosh-dns removing A/0 from its response because a local failed health check.

Since the issue is not reproducible I'm gonna close this issue.