Closed keymon closed 8 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/130658109
The labels on this github issue will be updated when the story is started.
Starting 18 September 2016 at 04:17, our monitoring based on the CF Smoke tests, started failing. The applications did not get their new routes registered after being push.
Investigating further, we found out that any application, after push, was taking up to 60s to get their routes registered.
We analised the route-emitter
service, and discovered that:
route-emitter
was not sending the corresponding nats message when the application was created. Logs route-emitter.watcher.handling-actual-create.complete
and route-emitter.watcher.handling-actual-update.watcher-add-and-emit
were missing.route-emitter
was sending the routes after sync route-emitter.syncer.syncing
and emit route-emitter.watcher.handling-actual-update.watcher-add-and-emit
This suggested that route-emitter
was not receiving the event models.ActualLRPCreatedEvent
from BBS. But in the logs of the active BBS, there were the related logs bbs.request.start-actual-lrp.*
and the app was starting.
Checking the connections of the route-emitter
, we found that there were connections to two different BBS:
# netstat -tpn |grep 8889
tcp 0 0 10.0.17.8:55427 10.0.17.5:8889 ESTABLISHED 20622/route-emitter
tcp 0 0 10.0.17.8:55477 10.0.17.5:8889 ESTABLISHED 20622/route-emitter
tcp 0 0 10.0.17.8:51114 10.0.16.5:8889 ESTABLISHED 20622/route-emitter
This should not happen as only one should be active.
The consul service DNS responded two IPs:
$ host bbs.service.cf.internal
bbs.service.cf.internal has address 10.0.16.5
bbs.service.cf.internal has address 10.0.17.5
Since several days ago (Last lock acquired September 7th 2016, 17:31:12.157) BBS id 0 had the lock and was running as the unique BBS leader instance.
But on September 14th 2016, 19:41:32.837, the BBS id 1 managed to get the lock, becoming leader, but the BBS id 0 did not lose that lock.
This caused a situation of brain-split in the BBS cluster, but the impact was low as both BBS used the same backend (etcd). The problem was mainly the event distribution inside of BBS, but this was mitigated because all services, including route-emitter
were running and keeping their connections open to the same BBS instance.
In some point in time, route-emitter
re-established some of the connections, and it stoping getting the message models.ActualLRPCreatedEvent
.
Question to investigate and report upstream: Why two BBS instance acquired the lock and remain in that state.
Restarting the BBS id 1 did not solve the issue, as it was the instance holding the lock: The lock was being released and gained again.
Restarting the BBS id 0 did fix the issue. The BBS service got blocked when acquiring the lock in consul, and the service was de-registered from consul DNS.
All services with openned connections reconnected to the right BBS instance.
Thanks for the excellent report, @keymon. We'll investigate this right away.
Best, Eric, CF Runtime Diego PM
Hi keymon,
Thanks for the detailed bug report. We had a few questions for you.
Also, we believe it's unrelated here, but in general, we recommend deploying an odd number of database (BBS) VMs if ETCD is your backend. The ETCD job is deployed on same VM as the BBS job by default, and ETCD itself is a raft cluster. Thus, having 1 BBS or 2 BBS's makes no difference to high availability. Elections won't be possible within ETCD when there are only 2 instances, so if one fails, the other won't be able to to be elected. For high availability with ETCD as the backend, we recommend deploying 3 BBS nodes.
Thanks, @jvshahid and @jenspinney
Are the logs you provided filtered at all, or are those exactly the same as what you downloaded from bosh (except for isolating the logs to a specific timeframe).
I extracted all the logs them from our ELK platform. I think they are in the servers anymore in, they rotated. We use the logsearch distribution, and loggregator with metron agent. I got the logs filtering by jobname & component. I do not discard that some logs might have been drop.
The BBS logs were indeed filtered, to remove all the stuff not related to locks or start/stop. The query was like: NOT @message:bbs.request.* AND NOT @message:bbs.converge-lrps* AND NOT @message:bbs.executing-convergence* AND NOT @message:bbs.task-completion-workpool* AND NOT @message:bbs.actual-lrp-handler* AND @message:bbs*
Where you doing a deployment around 19:40 on Sept 14th (the timestamp of the consul server logs)? It's interesting that so many nodes were unreachable by consul during that time.
No, nothing AFAIK. No tasks run on bosh at that time.
I will have a look in AWS console and come back later.
In your deployment manifest, do the static IPs listed for the consul cluster match across VMs that list them? We've seen a similar problem before when some nodes were missing one of the three consul cluster IPs.
Yes, all the consul IPs are setup everywhere. We deploy consul agents in all the systems with the same config, including the 3x different IPs of the masters.
Also, we believe it's unrelated here, but in general, we recommend deploying an odd number of database (BBS) VMs if ETCD is your backend.
We deploy etcd in 3x dedicated servers, one in each AZ. So the etcd cluster is fine.
Where you doing a deployment around 19:40 on Sept 14th (the timestamp of the consul server logs)? It's interesting that so many nodes were unreachable by consul during that time.
No, nothing AFAIK. No tasks run on bosh at that time. I will have a look in AWS console and come back later.
I cannot find anything at all in AWS console or in the servers. No interface errors. Only consul getting nuts.
I think the most confusing thing about the bbs logs is that bbs job with index 0
doesn't have any log lines after September 11th 2016; 09:05:12
. It is very hard to know what was going on that bbs when the consul cluster had no leader. Is it possible to get the full logs ?
Attaching all the logs from the vms "database" (bbs
) from 2016-09-14 19:35 to 19:45
database-all.log.gz
I got more information. The database/0
vm stopped sending log messages on Sept 14 19:40, but it came back on Sept 17th at 17:19. But the VM was up all time. This correlates with the CPU usage:
In the ELK stack I don't have too many logs and I suspect that logs might been dropped. But I was able to recover the logs from the bbs
on database.0
and consul_agent
on database.0
and database.1
, please find them attached.
You can observe that on 17th at 17:18 there is a suspicious message, that might mean that database.0
registered back as a service:
2016/09/17 17:18:28 [INFO] agent: Synced service 'bbs'
2016/09/17 17:18:28 [INFO] agent: Synced check 'service:bbs'
database.1.consul_agent.stdout.log.gz database.0.consul_agent.stdout.log.gz database.0.bbs.stdout.log.gz
Adding logs from route_emitter/1
. The route-emitter.watcher.handling-actual-create.complete
was not being logged from 09/19 10:57:39.51 to 15:35:03.05
route_emitter.0.route_emitter.stdout.log.gz
All raw logs from the host for BBS here: https://s3-eu-west-1.amazonaws.com/hector-state/database.0.bbs.stdout.log.gz https://s3-eu-west-1.amazonaws.com/hector-state/database.1.bbs.stdout.log.gz
Note: the host bbs.service.cf.internal
with two entries was happening between (at least) 09/19 10:57 and 15:35
bbs.service.cf.internal has address 10.0.17.5
@keymon i'm not sure what to do with these logs. It is really hard to make sense of them. I'm trying to find a timestamp in the logs where the two bbs instances where responding to requests or thought they had the locks but it is really difficult. bbs.log.gz makes it even more confusing, since the last log line that appear from job database
index 0
is on 2016/09/07
but the raw logs have lock.started
on the following dates:
2016/09/14
2016/09/17
which makes me believe that the bbs logs provided earlier are aggressively filtered. can you please provide the full bbs logs when you saw multiple bbs job ips in consul's dns response. Another question, are the time stamps mentioned in the previous comment in UTC or another timezone ?
I'm also curious to know if you have this property set to true for the consul_agents consul.agent.dns_config.allow_state
? if yes, what's the value of consul.agent.dns_config.max_stale
?
No, we don't have the stale setting enabled.
About the logs: in comment https://github.com/cloudfoundry/diego-release/issues/201#issuecomment-248671156 you can find all the raw logs we have.
I do not discard that some logs were dropped in the first ones.
Combined, sorted, and formatted all the lager log lines into one file combined-sorted-formatted.log
in hopes of providing better readability. View formatted with
cat combined-sorted-formatted.log | less -r
Also stripped the lager log lines from the consul_agent stdout from database-0
and database-1
.
@keymon Unfortunately, we are unable to make much progress diagnosing this issue due to the gaps in the logs and lack of information regarding what happened exactly to cause this. As a result we'll be closing this out for now, but please let us know if you happen to encounter it again or determine reproducible steps so we can investigate more on our end. If you happen to see it again, try to dump as many logs as soon as you can while they're still accessible. In the meantime we'll try to forcibly reproduce this issue on a test environment.
Thanks again, @keymon. I've prioritized a separate story in the Diego backlog for the team to attempt to reproduce this situation independently: https://www.pivotaltracker.com/story/show/131822105.
Best, Eric
Summary
We had a situation in our production Cloud Foundry, where two different BBS instances where thinking that they were the leader.
Both BBS were registered in consul, which return 2 ips for
bbs.service.cf.internal
.The issue seems to happen after one of the consul nodes (cluster of 3x) is marked as failed.
Environment description
route_emitter/{0,1}
, bbs =database/{0,1}
, consul:consul/{0,1,2}
Observed symptoms
Due this situation, the route-emitter was connecting to the wrong BBS server and not getting the events of
models.ActualLRPCreatedEvent
. This delayed the creation of the routes to new instances up to 60 seconds, until the route of sync+emit routes was running.Checking the connections of the
route-emitter
, we found that there were connections to two different BBS:This should not happen as only one should be active.
The consul service DNS responded two IPs:
Logs and events:
You can find attached all the logs related to locks from the two BBS services in
bbs.log
. The issue starts around September 14th 2016, 19:41In
consul.log
there are the logs related to the consul servers around September 14th 2016, 19:41.In
consul_agent_bbs.log
there are the logs from the consul agent running on the BBS VMs. Some interesting messages there:bbs.log.gz consul_agent_bbs.log.gz consul.log.gz