Closed prashantgnet closed 6 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/156085504
The labels on this github issue will be updated when the story is started.
Hi @prashantgnet,
It looks like you may be having connectivity issues between your bbs job and the database nodes. This could be attributed to IaaS network issues.
In order to further diagnose this we could use some database
and diego-api
logs downloaded at the same time so we can correlate the failures to each other. It looks like the logs you gave us for the diego-api
, and the database
nodes were downloaded at different times (given the timestamps in the logs) is this correct?
At this moment we suspect that something may be misconfigured in Openstack (possibly network related). Is this a new cf-deployment? If it is not, could you push apps to it before?
Thanks, Nitya & Al
Hi @APShirley
I will check with our infrastructure support team for connectivity issue or networking issue and I will share them the log files for the time range when I actually lose any connection. We are actually using Openstack based environment from last couple of years so I am really don't think so any connectivity or network issues. In case is there any then why only for bbs and database instance every time, why not with other instances of CF environment.
I am facing this issues consistently only for BBS and MySQL database from last 15 days. I did multiple time fresh deployment for cf-deployment environment starting from v1.15.0 and right now, I am using v1.19.0 with single zone configuration and to avoid any further issues on connectivity side, I have using single instance of MySQL instead of cluster.
From last five hours, I am pushing the apps under CF environment to reproduce the mentioned issue and I will share you the exact log files once I am able to reproduce it.
This time, I restarted monit job on database, diego-api and scheduler instances by cleaning up old files so we are sure that we will get exact the information under log files.
Right now, I am using following performance / fine tuning parameters for MySQL which I got from https://github.com/cloudfoundry/cf-deployment/issues/425. Just let me know in case values are not feasible.
# MySQL performance improvement parameters
- type: replace
path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/innodb_buffer_pool_size?
value: 24_000_000_000
- type: replace
path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/innodb_buffer_pool_instances?
value: 32
- type: replace
path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/innodb_flush_log_at_trx_commit?
value: 2
- type: replace
path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/binlog_enabled?
value: false
- type: replace
path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/max_connections?
value: 10000
cf-deplpyment instance information, v1.19.0
Instance Process State AZ IPs
adapter/fa078a5e-b473-47d5-bc71-6d818f692ee4 running z1 192.168.15.14
api/003c8a20-e0f7-40f7-83d3-b15207d58736 running z1 192.168.15.19
sk
cc-worker/3e27d15b-d4e7-4ac9-9a37-d8dd35d6660f running z1 192.168.15.20
consul/e76a366a-a7b6-4781-8968-1a198b8c2446 running z1 192.168.15.12
database/8f7a790c-a781-4f38-a53b-5c510a398cb0 running z1 192.168.15.15
diego-api/adb112a1-db04-4457-a59b-56ac1b3628c1 running z1 192.168.15.16
diego-cell/144a2d62-3741-4d91-8695-226a55fa670e running z1 192.168.15.28
emeral-disk
diego-cell/237013ca-9abd-48d4-9c7a-c4617ae65dc5 running z1 192.168.15.29
emeral-disk
diego-cell/38ac3525-07a3-4b87-a66c-68c4bc1661c4 running z1 192.168.15.30
emeral-disk
diego-cell/54a88371-faf0-4950-b806-b5cdbdc7a287 running z1 192.168.15.31
emeral-disk
diego-cell/7affcd79-6730-480a-82de-37e8d6de1e90 running z1 192.168.15.32
emeral-disk
diego-cell/8057b1fb-6350-44cc-b9ea-170d7a572711 running z1 192.168.15.25
emeral-disk
diego-cell/88e09136-f050-42f7-be8e-2edd584b31cf running z1 192.168.15.33
emeral-disk
diego-cell/ab55f241-3223-407d-adba-8a75e8348488 running z1 192.168.15.34
emeral-disk
diego-cell/cc807e44-0b6f-4096-a3c3-d18075727665 running z1 192.168.15.35
emeral-disk
diego-cell/ff3305ad-35b5-41c4-8e56-739f8a508710 running z1 192.168.15.27
emeral-disk
doppler/992ee3df-d2d0-4b76-9a56-f2e7100c2000 running z1 192.168.15.24
log-api/c1b1b63d-4499-431a-a4c4-b44dd8c6edce running z1 192.168.15.26
nats/95521e92-6651-47a5-a888-cada160174f3 running z1 192.168.15.13
router/d7001baa-16cf-4a80-b2e3-24d6aab62013 running z1 192.168.15.21
10.10x.xx.xx
scheduler/b6dfc166-f0a9-411e-a463-d71e71fe1d61 running z1 192.168.15.23
singleton-blobstore/9c381ac6-28e0-4d41-9689-75eb7a59c89f running z1 192.168.15.18
tcp-router/bf45cfbf-df63-43de-bd8a-4e2ed3b7768c running z1 192.168.15.22
uaa/1f06ee00-6f7c-4809-926a-c26e9188f83f running z1 192.168.15.17
Release details
binary-buildpack 1.0.17*
capi 1.50.0*
cf-mysql 36.11.0*
cf-networking 1.11.0*
cf-smoke-tests 40*
cf-syslog-drain 6.0*
cflinuxfs2 1.190.0*
consul 192*
diego 2.1.0*
dotnet-core-buildpack 2.0.4*
garden-runc 1.12.0*
go-buildpack 1.8.20*
java-buildpack 4.9*
loggregator 102*
nats 23*
nodejs-buildpack 1.6.20*
php-buildpack 4.3.51*
python-buildpack 1.6.11*
routing 0.174.0*
ruby-buildpack 1.7.15*
staticfile-buildpack 1.4.24*
statsd-injector 1.2.0*
uaa 56*
Thanks, Prashant
Hey @prashantgnet. Looking at the logs from the database nodes we can explain the problem, although we do not know the exact root cause.
We the proxies on two of the nodes (192.168.16.12
and 192.168.17.12
) failed to get the database status from 192.168.16.12
and switched their backends to 192.168.17.12
. This only lasted for a few seconds before they reestablished contact with 192.168.16.12
.
The third node 192.168.15.15
did not have a problem getting the status from 192.168.16.12
so it continued to route traffic to 192.168.16.12
. Diego is particularly sensitive to different proxies routing traffic to different nodes because of the way they try to acquire locks. It looks like during these few seconds when the proxies were pointing at different backends, Diego had some lock errors as they were talking to two different backends at the same time.
So this explains the behavior that was seen, but we don't know the reason why the two nodes had trouble getting the status of 192.168.16.12
. It could have been the process was unresponsive due to high load on the IAAS. It seems unlikely to have been a networking issue since the proxy running on 192.168.16.12
was one of the ones detecting the problem, so it should have been communicating only with itself.
Running with a single database node should prevent this problem since there is no chance of Diego talking to multiple backends at the same time, but your deployment will no longer be HA at this point either.
Hi,
Environment is single zone with single instance of MySQL using cf-deployment v1.19.0
More details on for the mentioned issue. I will share all the log files for the same time slot for easy debugging.
bbs job on diego-api shows following error messages under /var/vcap/monit/monit.log. restating around 7 times in last 24 hours.
[UTC Mar 20 14:40:01] error : 'bbs' process is not running
[UTC Mar 20 14:40:02] info : 'bbs' trying to restart
[UTC Mar 20 14:40:02] info : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 14:40:13] info : 'bbs' process is running with pid 29349
[UTC Mar 20 17:23:32] error : 'bbs' process is not running
[UTC Mar 20 17:23:32] info : 'bbs' trying to restart
[UTC Mar 20 17:23:32] info : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 17:23:43] info : 'bbs' process is running with pid 4322
[UTC Mar 20 18:08:55] error : 'bbs' process is not running
[UTC Mar 20 18:08:55] info : 'bbs' trying to restart
[UTC Mar 20 18:08:55] info : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 18:09:06] info : 'bbs' process is running with pid 6355
[UTC Mar 20 20:43:55] error : 'bbs' process is not running
[UTC Mar 20 20:43:55] info : 'bbs' trying to restart
[UTC Mar 20 20:43:55] info : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 20:44:06] info : 'bbs' process is running with pid 13117
[UTC Mar 20 21:08:48] error : 'bbs' process is not running
[UTC Mar 20 21:08:48] info : 'bbs' trying to restart
[UTC Mar 20 21:08:48] info : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 21:08:59] info : 'bbs' process is running with pid 14270
[UTC Mar 20 21:15:29] error : 'bbs' process is not running
[UTC Mar 20 21:15:29] info : 'bbs' trying to restart
[UTC Mar 20 21:15:29] info : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 21:15:40] info : 'bbs' process is running with pid 14633
[UTC Mar 21 01:00:53] error : 'bbs' process is not running
[UTC Mar 21 01:00:53] info : 'bbs' trying to restart
[UTC Mar 21 01:00:53] info : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 21 01:01:04] info : 'bbs' process is running with pid 24474
[UTC Mar 21 03:37:42] error : 'bbs' process is not running
[UTC Mar 21 03:37:42] info : 'bbs' trying to restart
[UTC Mar 21 03:37:42] info : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 21 03:37:53] info : 'bbs' process is running with pid 31315
/var/vcap/sys/log/bbs/bbs.stderr.log from diego-api instance shows following error messages, same time shows in above log i.e. around 18:08:55 when monit shows bbs process is not running.
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
Multiple error messages under /var/vcap/sys/log/bbs/bbs.stdout.log under diego-api instance. Please check the attached zip file for more information.
/var/vcap/sys/log/locket/locket.stderr.log from diego-api instance shows following error message, timing is same.
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
2018/03/20 18:08:54 grpc: Server.processUnaryRPC failed to write status: stream error: code = 4 desc = "context deadline exceeded"
/var/vcap/sys/log/locket/locket.stdout.log from diego-api instance shows following error message, not sure about timing.
{"timestamp":"1521520927.388530254","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","log_level":2,"data":{"error":"rpc error: code = 5 desc = resource-not-found","key":"tps_watcher","modified-index":42,"session":"312.2","type":"lock"}}
{"timestamp":"1521520927.478413343","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","log_level":2,"data":{"error":"rpc error: code = 5 desc = resource-not-found","key":"auctioneer","modified-index":352,"session":"314.2","type":"lock"}}
Metron agent shows following message for same time on diego-api instance
2018/03/20 18:08:54 Failed to receive data: rpc error: code = Canceled desc = context canceled
/var/vcap/sys/log/silk-controller/silk-controller.stderr.log from diego-api instance shows following
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36998->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36992->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36990->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36994->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36996->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36998->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:37000->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:53 packets.go:36: read tcp 192.168.15.16:37014->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
As such no critical error messages on database instance.
following message under api instance for cloud_controller_ng job under cloud_controller_worker_ctl.err.log
[2018-03-20 18:08:54+0000] Sequel::DatabaseDisconnectError: Mysql2::Error: Lost connection to MySQL server during query
following message under cc-worker instance for cloud_controller_worker job
{"timestamp":1521569333.571033,"message":"2018-03-20T18:08:53+0000: [Worker(cc_global_worker.cc-worker.0.1)] Error while reserving job: Mysql2::Error: Lost connection to MySQL server during query","log_level":"info","source":"cc-worker","data":{},"thread_id":47087408492380,"fiber_id":47087433721400,"process_id":11362,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.4.0/gems/delayed_job-4.1.3/lib/delayed/worker.rb","lineno":285,"method":"say"}
Lock collision messages for scheduler instance tps job
{"timestamp":"1521556797.335353374","source":"tps-watcher","message":"tps-watcher.locket-lock.failed-to-acquire-lock","log_level":2,"data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"tps_watcher","owner":"5db0c889-7346-48e3-5843-a8681e9cbb8c","type":"lock","type_code":1},"session":"2","ttl_in_seconds":15}}
check attached zip file which contains all instances log except api and router
I am validating cf-deployment environment using single network and single router with single interface configuration just to make sure that there is no bottleneck on IaaS side.
Looking at the database proxy logs, it looks again like the proxy lost the connection to the database. This is surprising since they are co-located on the same VM. It seems like there is some problem going on in the IAAS. Unsure what it could be though.
The fact that many other VMs had problems at the exact same time adds evidence to the theory that there was an IAAS level event at that point in time.
Thanks for the update, I am working with our IaaS support team and will update you.
I am not able to re-produce mentioned issue for new tenant under Openstack so closing this issue. Thanks a lot for debugging and providing valuable inputs for the same. Thanks ~ Prashant
I have setup cf-deployment based Cloud Foundry environment under Openstack which by default comes with MySQL database instance. I am getting following error messages under diego-api's bbs job for MySQL database
Under bbs.stderr.log
Under bbs.stdout.log
I have already opened https://github.com/cloudfoundry/cf-deployment/issues/425 issue against cf-deployment for mentioned and they asked me to raise this particular issue for MySQL release. You may check more details under it.
attaching logs from diego-api, database cluster (3 instances) and scheduler
logs.zip
Kindly let me know in case any more details are required from me.
Thanks, Prashant