cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

Multiple BBS instances think they are the leader: router-emiter not working properly. #201

Closed keymon closed 8 years ago

keymon commented 8 years ago

Summary

We had a situation in our production Cloud Foundry, where two different BBS instances where thinking that they were the leader.

Both BBS were registered in consul, which return 2 ips for bbs.service.cf.internal.

The issue seems to happen after one of the consul nodes (cluster of 3x) is marked as failed.

Environment description

Due this situation, the route-emitter was connecting to the wrong BBS server and not getting the events of models.ActualLRPCreatedEvent. This delayed the creation of the routes to new instances up to 60 seconds, until the route of sync+emit routes was running.

Checking the connections of the route-emitter, we found that there were connections to two different BBS:

# netstat -tpn  |grep 8889
tcp        0      0 10.0.17.8:55427         10.0.17.5:8889          ESTABLISHED 20622/route-emitter
tcp        0      0 10.0.17.8:55477         10.0.17.5:8889          ESTABLISHED 20622/route-emitter
tcp        0      0 10.0.17.8:51114         10.0.16.5:8889          ESTABLISHED 20622/route-emitter

This should not happen as only one should be active.

The consul service DNS responded two IPs:

$ host bbs.service.cf.internal
bbs.service.cf.internal has address 10.0.16.5
bbs.service.cf.internal has address 10.0.17.5

Logs and events:

You can find attached all the logs related to locks from the two BBS services in bbs.log. The issue starts around September 14th 2016, 19:41

In consul.log there are the logs related to the consul servers around September 14th 2016, 19:41.

In consul_agent_bbs.log there are the logs from the consul agent running on the BBS VMs. Some interesting messages there:

2016/09/14 19:40:41 [ERR] http: Request GET /v1/kv/v1/locks/bbs_lock?index=1214439&wait=15000ms, error: rpc error: No cluster leader from=127.0.0.1:53672

bbs.log.gz consul_agent_bbs.log.gz consul.log.gz

cf-gitbot commented 8 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/130658109

The labels on this github issue will be updated when the story is started.

keymon commented 8 years ago

Specific Incident details (for reference)

Symptoms

Starting 18 September 2016 at 04:17, our monitoring based on the CF Smoke tests, started failing. The applications did not get their new routes registered after being push.

Investigating further, we found out that any application, after push, was taking up to 60s to get their routes registered.

We analised the route-emitter service, and discovered that:

This suggested that route-emitter was not receiving the event models.ActualLRPCreatedEvent from BBS. But in the logs of the active BBS, there were the related logs bbs.request.start-actual-lrp.* and the app was starting.

Root cause

Checking the connections of the route-emitter, we found that there were connections to two different BBS:

# netstat -tpn  |grep 8889
tcp        0      0 10.0.17.8:55427         10.0.17.5:8889          ESTABLISHED 20622/route-emitter
tcp        0      0 10.0.17.8:55477         10.0.17.5:8889          ESTABLISHED 20622/route-emitter
tcp        0      0 10.0.17.8:51114         10.0.16.5:8889          ESTABLISHED 20622/route-emitter

This should not happen as only one should be active.

The consul service DNS responded two IPs:

$ host bbs.service.cf.internal
bbs.service.cf.internal has address 10.0.16.5
bbs.service.cf.internal has address 10.0.17.5

Diagnosis

Since several days ago (Last lock acquired September 7th 2016, 17:31:12.157) BBS id 0 had the lock and was running as the unique BBS leader instance.

But on September 14th 2016, 19:41:32.837, the BBS id 1 managed to get the lock, becoming leader, but the BBS id 0 did not lose that lock.

This caused a situation of brain-split in the BBS cluster, but the impact was low as both BBS used the same backend (etcd). The problem was mainly the event distribution inside of BBS, but this was mitigated because all services, including route-emitter were running and keeping their connections open to the same BBS instance.

In some point in time, route-emitter re-established some of the connections, and it stoping getting the message models.ActualLRPCreatedEvent.

Question to investigate and report upstream: Why two BBS instance acquired the lock and remain in that state.

Resolution

Restarting the BBS id 1 did not solve the issue, as it was the instance holding the lock: The lock was being released and gained again.

Restarting the BBS id 0 did fix the issue. The BBS service got blocked when acquiring the lock in consul, and the service was de-registered from consul DNS.

All services with openned connections reconnected to the right BBS instance.

emalm commented 8 years ago

Thanks for the excellent report, @keymon. We'll investigate this right away.

Best, Eric, CF Runtime Diego PM

jenspinney commented 8 years ago

Hi keymon,

Thanks for the detailed bug report. We had a few questions for you.

  1. Are the logs you provided filtered at all, or are those exactly the same as what you downloaded from bosh (except for isolating the logs to a specific timeframe).
  2. Where you doing a deployment around 19:40 on Sept 14th (the timestamp of the consul server logs)? It's interesting that so many nodes were unreachable by consul during that time.
  3. In your deployment manifest, do the static IPs listed for the consul cluster match across VMs that list them? We've seen a similar problem before when some nodes were missing one of the three consul cluster IPs.

Also, we believe it's unrelated here, but in general, we recommend deploying an odd number of database (BBS) VMs if ETCD is your backend. The ETCD job is deployed on same VM as the BBS job by default, and ETCD itself is a raft cluster. Thus, having 1 BBS or 2 BBS's makes no difference to high availability. Elections won't be possible within ETCD when there are only 2 instances, so if one fails, the other won't be able to to be elected. For high availability with ETCD as the backend, we recommend deploying 3 BBS nodes.

Thanks, @jvshahid and @jenspinney

keymon commented 8 years ago

Are the logs you provided filtered at all, or are those exactly the same as what you downloaded from bosh (except for isolating the logs to a specific timeframe).

I extracted all the logs them from our ELK platform. I think they are in the servers anymore in, they rotated. We use the logsearch distribution, and loggregator with metron agent. I got the logs filtering by jobname & component. I do not discard that some logs might have been drop.

The BBS logs were indeed filtered, to remove all the stuff not related to locks or start/stop. The query was like: NOT @message:bbs.request.* AND NOT @message:bbs.converge-lrps* AND NOT @message:bbs.executing-convergence* AND NOT @message:bbs.task-completion-workpool* AND NOT @message:bbs.actual-lrp-handler* AND @message:bbs*

Where you doing a deployment around 19:40 on Sept 14th (the timestamp of the consul server logs)? It's interesting that so many nodes were unreachable by consul during that time.

No, nothing AFAIK. No tasks run on bosh at that time.

I will have a look in AWS console and come back later.

In your deployment manifest, do the static IPs listed for the consul cluster match across VMs that list them? We've seen a similar problem before when some nodes were missing one of the three consul cluster IPs.

Yes, all the consul IPs are setup everywhere. We deploy consul agents in all the systems with the same config, including the 3x different IPs of the masters.

Also, we believe it's unrelated here, but in general, we recommend deploying an odd number of database (BBS) VMs if ETCD is your backend.

We deploy etcd in 3x dedicated servers, one in each AZ. So the etcd cluster is fine.

keymon commented 8 years ago

Where you doing a deployment around 19:40 on Sept 14th (the timestamp of the consul server logs)? It's interesting that so many nodes were unreachable by consul during that time.

No, nothing AFAIK. No tasks run on bosh at that time. I will have a look in AWS console and come back later.

I cannot find anything at all in AWS console or in the servers. No interface errors. Only consul getting nuts.

jvshahid commented 8 years ago

I think the most confusing thing about the bbs logs is that bbs job with index 0 doesn't have any log lines after September 11th 2016; 09:05:12. It is very hard to know what was going on that bbs when the consul cluster had no leader. Is it possible to get the full logs ?

keymon commented 8 years ago

Attaching all the logs from the vms "database" (bbs) from 2016-09-14 19:35 to 19:45 database-all.log.gz

keymon commented 8 years ago

I got more information. The database/0 vm stopped sending log messages on Sept 14 19:40, but it came back on Sept 17th at 17:19. But the VM was up all time. This correlates with the CPU usage:

image

In the ELK stack I don't have too many logs and I suspect that logs might been dropped. But I was able to recover the logs from the bbs on database.0 and consul_agent on database.0 and database.1, please find them attached.

You can observe that on 17th at 17:18 there is a suspicious message, that might mean that database.0 registered back as a service:

    2016/09/17 17:18:28 [INFO] agent: Synced service 'bbs'
    2016/09/17 17:18:28 [INFO] agent: Synced check 'service:bbs'

database.1.consul_agent.stdout.log.gz database.0.consul_agent.stdout.log.gz database.0.bbs.stdout.log.gz

keymon commented 8 years ago

Adding logs from route_emitter/1. The route-emitter.watcher.handling-actual-create.complete was not being logged from 09/19 10:57:39.51 to 15:35:03.05 route_emitter.0.route_emitter.stdout.log.gz

keymon commented 8 years ago

All raw logs from the host for BBS here: https://s3-eu-west-1.amazonaws.com/hector-state/database.0.bbs.stdout.log.gz https://s3-eu-west-1.amazonaws.com/hector-state/database.1.bbs.stdout.log.gz

Note: the host bbs.service.cf.internal with two entries was happening between (at least) 09/19 10:57 and 15:35

bbs.service.cf.internal has address 10.0.17.5

jvshahid commented 8 years ago

@keymon i'm not sure what to do with these logs. It is really hard to make sense of them. I'm trying to find a timestamp in the logs where the two bbs instances where responding to requests or thought they had the locks but it is really difficult. bbs.log.gz makes it even more confusing, since the last log line that appear from job database index 0 is on 2016/09/07 but the raw logs have lock.started on the following dates:

which makes me believe that the bbs logs provided earlier are aggressively filtered. can you please provide the full bbs logs when you saw multiple bbs job ips in consul's dns response. Another question, are the time stamps mentioned in the previous comment in UTC or another timezone ?

jvshahid commented 8 years ago

I'm also curious to know if you have this property set to true for the consul_agents consul.agent.dns_config.allow_state ? if yes, what's the value of consul.agent.dns_config.max_stale ?

keymon commented 8 years ago

No, we don't have the stale setting enabled.

About the logs: in comment https://github.com/cloudfoundry/diego-release/issues/201#issuecomment-248671156 you can find all the raw logs we have.

I do not discard that some logs were dropped in the first ones.

caod123 commented 8 years ago

Combined, sorted, and formatted all the lager log lines into one file combined-sorted-formatted.log in hopes of providing better readability. View formatted with cat combined-sorted-formatted.log | less -r Also stripped the lager log lines from the consul_agent stdout from database-0 and database-1.

logs-combined.tar.gz

caod123 commented 8 years ago

@keymon Unfortunately, we are unable to make much progress diagnosing this issue due to the gaps in the logs and lack of information regarding what happened exactly to cause this. As a result we'll be closing this out for now, but please let us know if you happen to encounter it again or determine reproducible steps so we can investigate more on our end. If you happen to see it again, try to dump as many logs as soon as you can while they're still accessible. In the meantime we'll try to forcibly reproduce this issue on a test environment.

emalm commented 8 years ago

Thanks again, @keymon. I've prioritized a separate story in the Diego backlog for the team to attempt to reproduce this situation independently: https://www.pivotaltracker.com/story/show/131822105.

Best, Eric