cloudfoundry / cf-mysql-release

Cloud Foundry MySQL Release
Apache License 2.0
58 stars 106 forks source link

"packets.go:36: unexpected EOF" errors for CF deployment's MySQL instance #203

Closed prashantgnet closed 6 years ago

prashantgnet commented 6 years ago

I have setup cf-deployment based Cloud Foundry environment under Openstack which by default comes with MySQL database instance. I am getting following error messages under diego-api's bbs job for MySQL database

Under bbs.stderr.log

goroutine 1 [running]:
code.cloudfoundry.org/lager.(*logger).Fatal(0xc42008c480, 0xb8e7f5, 0x26, 0xbf1fc0, 0xc420071740, 0x0, 0x0, 0x0)
        /var/vcap/packages/bbs/src/code.cloudfoundry.org/lager/logger.go:154 +0x504
main.main()
        /var/vcap/packages/bbs/src/code.cloudfoundry.org/bbs/cmd/bbs/main.go:159 +0x3c01
[mysql] 2018/03/19 04:06:33 packets.go:36: unexpected EOF
panic: invalid connection

goroutine 1 [running]:
code.cloudfoundry.org/lager.(*logger).Fatal(0xc420226000, 0xb8e7f5, 0x26, 0xbf1fc0, 0xc4201c2110, 0x0, 0x0, 0x0)
        /var/vcap/packages/bbs/src/code.cloudfoundry.org/lager/logger.go:154 +0x504
main.main()
        /var/vcap/packages/bbs/src/code.cloudfoundry.org/bbs/cmd/bbs/main.go:159 +0x3c01
[mysql] 2018/03/19 04:06:48 packets.go:36: unexpected EOF
panic: invalid connection

goroutine 1 [running]:
code.cloudfoundry.org/lager.(*logger).Fatal(0xc42024c000, 0xb8e7f5, 0x26, 0xbf1fc0, 0xc4201f0120, 0x0, 0x0, 0x0)
        /var/vcap/packages/bbs/src/code.cloudfoundry.org/lager/logger.go:154 +0x504
main.main()
        /var/vcap/packages/bbs/src/code.cloudfoundry.org/bbs/cmd/bbs/main.go:159 +0x3c01
2018/03/19 04:09:50 http: multiple response.WriteHeader calls
2018/03/19 04:09:50 http: multiple response.WriteHeader calls
[mysql] 2018/03/19 06:37:03 packets.go:36: unexpected EOF
[mysql] 2018/03/19 06:37:03 packets.go:36: unexpected EOF
[mysql] 2018/03/19 06:37:03 packets.go:36: unexpected EOF
[mysql] 2018/03/19 06:37:03 packets.go:36: unexpected EOF
[mysql] 2018/03/19 06:37:03 packets.go:36: unexpected EOF
[mysql] 2018/03/19 06:37:04 connection.go:96: invalid connection
[mysql] 2018/03/19 06:37:04 connection.go:96: invalid connection
[mysql] 2018/03/19 06:37:04 connection.go:96: invalid connection
[mysql] 2018/03/19 06:37:04 connection.go:96: invalid connection
[mysql] 2018/03/19 06:37:04 connection.go:96: invalid connection
[mysql] 2018/03/19 06:37:04 packets.go:36: unexpected EOF
[mysql] 2018/03/19 06:37:04 packets.go:36: unexpected EOF
[mysql] 2018/03/19 06:37:08 packets.go:36: unexpected EOF
[mysql] 2018/03/19 06:37:08 connection.go:96: invalid connection

Under bbs.stdout.log

{"timestamp":"1521180205.880357027","source":"bbs","message":"bbs.exited-with-failure","log_level":2,"data":{"error":"Exit trace for group:\nlock exited with error: Exit trace for group:\nsql-lock exited with error: rpc error: code = 4 desc = context deadline exceeded\nlock-maintainer exited with nil\n\nregistration-runner exited with nil\nconverger exited with nil\nperiodic-metrics exited with nil\nbbs-election-metrics exited with nil\nhub-maintainer exited with nil\nencryptor exited with nil\nmigration-manager exited with nil\nserver exited with nil\nworkpool exited with nil\nset-lock-held-metrics exited with nil\nlock-held-metrics exited with nil\nperiodic-filedescriptor-metrics exited with nil\ntask-stat-metron-notifier exited with nil\nhealthcheck exited with nil\ndebug-server exited with nil\n"}}

I have already opened https://github.com/cloudfoundry/cf-deployment/issues/425 issue against cf-deployment for mentioned and they asked me to raise this particular issue for MySQL release. You may check more details under it.

attaching logs from diego-api, database cluster (3 instances) and scheduler

logs.zip

Kindly let me know in case any more details are required from me.

Thanks, Prashant

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/156085504

The labels on this github issue will be updated when the story is started.

APShirley commented 6 years ago

Hi @prashantgnet,

It looks like you may be having connectivity issues between your bbs job and the database nodes. This could be attributed to IaaS network issues. In order to further diagnose this we could use some database and diego-api logs downloaded at the same time so we can correlate the failures to each other. It looks like the logs you gave us for the diego-api, and the database nodes were downloaded at different times (given the timestamps in the logs) is this correct?

At this moment we suspect that something may be misconfigured in Openstack (possibly network related). Is this a new cf-deployment? If it is not, could you push apps to it before?

Thanks, Nitya & Al

prashantgnet commented 6 years ago

Hi @APShirley

I will check with our infrastructure support team for connectivity issue or networking issue and I will share them the log files for the time range when I actually lose any connection. We are actually using Openstack based environment from last couple of years so I am really don't think so any connectivity or network issues. In case is there any then why only for bbs and database instance every time, why not with other instances of CF environment.

I am facing this issues consistently only for BBS and MySQL database from last 15 days. I did multiple time fresh deployment for cf-deployment environment starting from v1.15.0 and right now, I am using v1.19.0 with single zone configuration and to avoid any further issues on connectivity side, I have using single instance of MySQL instead of cluster.

From last five hours, I am pushing the apps under CF environment to reproduce the mentioned issue and I will share you the exact log files once I am able to reproduce it.

This time, I restarted monit job on database, diego-api and scheduler instances by cleaning up old files so we are sure that we will get exact the information under log files.

Right now, I am using following performance / fine tuning parameters for MySQL which I got from https://github.com/cloudfoundry/cf-deployment/issues/425. Just let me know in case values are not feasible.

# MySQL performance improvement parameters
- type: replace
  path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/innodb_buffer_pool_size?
  value: 24_000_000_000
- type: replace
  path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/innodb_buffer_pool_instances?
  value: 32 
- type: replace
  path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/innodb_flush_log_at_trx_commit?
  value: 2
- type: replace
  path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/binlog_enabled?
  value: false
- type: replace
  path: /instance_groups/name=database/jobs/name=mysql/properties/cf_mysql/mysql/max_connections?
  value: 10000 

cf-deplpyment instance information, v1.19.0

Instance                                                  Process State  AZ  IPs           
adapter/fa078a5e-b473-47d5-bc71-6d818f692ee4              running        z1  192.168.15.14 
api/003c8a20-e0f7-40f7-83d3-b15207d58736                  running        z1  192.168.15.19 
sk  
cc-worker/3e27d15b-d4e7-4ac9-9a37-d8dd35d6660f            running        z1  192.168.15.20 
consul/e76a366a-a7b6-4781-8968-1a198b8c2446               running        z1  192.168.15.12 
database/8f7a790c-a781-4f38-a53b-5c510a398cb0             running        z1  192.168.15.15 
diego-api/adb112a1-db04-4457-a59b-56ac1b3628c1            running        z1  192.168.15.16 
diego-cell/144a2d62-3741-4d91-8695-226a55fa670e           running        z1  192.168.15.28 
emeral-disk  
diego-cell/237013ca-9abd-48d4-9c7a-c4617ae65dc5           running        z1  192.168.15.29 
emeral-disk  
diego-cell/38ac3525-07a3-4b87-a66c-68c4bc1661c4           running        z1  192.168.15.30 
emeral-disk  
diego-cell/54a88371-faf0-4950-b806-b5cdbdc7a287           running        z1  192.168.15.31 
emeral-disk  
diego-cell/7affcd79-6730-480a-82de-37e8d6de1e90           running        z1  192.168.15.32 
emeral-disk  
diego-cell/8057b1fb-6350-44cc-b9ea-170d7a572711           running        z1  192.168.15.25 
emeral-disk  
diego-cell/88e09136-f050-42f7-be8e-2edd584b31cf           running        z1  192.168.15.33 
emeral-disk  
diego-cell/ab55f241-3223-407d-adba-8a75e8348488           running        z1  192.168.15.34 
emeral-disk  
diego-cell/cc807e44-0b6f-4096-a3c3-d18075727665           running        z1  192.168.15.35 
emeral-disk  
diego-cell/ff3305ad-35b5-41c4-8e56-739f8a508710           running        z1  192.168.15.27 
emeral-disk  
doppler/992ee3df-d2d0-4b76-9a56-f2e7100c2000              running        z1  192.168.15.24 
log-api/c1b1b63d-4499-431a-a4c4-b44dd8c6edce              running        z1  192.168.15.26 
nats/95521e92-6651-47a5-a888-cada160174f3                 running        z1  192.168.15.13 
router/d7001baa-16cf-4a80-b2e3-24d6aab62013               running        z1  192.168.15.21 
                                                                             10.10x.xx.xx 
scheduler/b6dfc166-f0a9-411e-a463-d71e71fe1d61            running        z1  192.168.15.23 
singleton-blobstore/9c381ac6-28e0-4d41-9689-75eb7a59c89f  running        z1  192.168.15.18 
tcp-router/bf45cfbf-df63-43de-bd8a-4e2ed3b7768c           running        z1  192.168.15.22 
uaa/1f06ee00-6f7c-4809-926a-c26e9188f83f                  running        z1  192.168.15.17 

Release details

binary-buildpack       1.0.17* 
capi                   1.50.0* 
cf-mysql               36.11.0*
cf-networking          1.11.0* 
cf-smoke-tests         40*     
cf-syslog-drain        6.0*    
cflinuxfs2             1.190.0*
consul                 192*    
diego                  2.1.0*  
dotnet-core-buildpack  2.0.4*  
garden-runc            1.12.0* 
go-buildpack           1.8.20* 
java-buildpack         4.9*    
loggregator            102*    
nats                   23*     
nodejs-buildpack       1.6.20* 
php-buildpack          4.3.51* 
python-buildpack       1.6.11* 
routing                0.174.0*
ruby-buildpack         1.7.15* 
staticfile-buildpack   1.4.24* 
statsd-injector        1.2.0*  
uaa                    56*     

Thanks, Prashant

ndhanushkodi commented 6 years ago

Hey @prashantgnet. Looking at the logs from the database nodes we can explain the problem, although we do not know the exact root cause.

We the proxies on two of the nodes (192.168.16.12 and 192.168.17.12) failed to get the database status from 192.168.16.12 and switched their backends to 192.168.17.12. This only lasted for a few seconds before they reestablished contact with 192.168.16.12.

The third node 192.168.15.15 did not have a problem getting the status from 192.168.16.12 so it continued to route traffic to 192.168.16.12. Diego is particularly sensitive to different proxies routing traffic to different nodes because of the way they try to acquire locks. It looks like during these few seconds when the proxies were pointing at different backends, Diego had some lock errors as they were talking to two different backends at the same time.

So this explains the behavior that was seen, but we don't know the reason why the two nodes had trouble getting the status of 192.168.16.12. It could have been the process was unresponsive due to high load on the IAAS. It seems unlikely to have been a networking issue since the proxy running on 192.168.16.12 was one of the ones detecting the problem, so it should have been communicating only with itself.

Running with a single database node should prevent this problem since there is no chance of Diego talking to multiple backends at the same time, but your deployment will no longer be HA at this point either.

prashantgnet commented 6 years ago

Hi,

Environment is single zone with single instance of MySQL using cf-deployment v1.19.0

More details on for the mentioned issue. I will share all the log files for the same time slot for easy debugging.

bbs job on diego-api shows following error messages under /var/vcap/monit/monit.log. restating around 7 times in last 24 hours.

[UTC Mar 20 14:40:01] error    : 'bbs' process is not running
[UTC Mar 20 14:40:02] info     : 'bbs' trying to restart
[UTC Mar 20 14:40:02] info     : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 14:40:13] info     : 'bbs' process is running with pid 29349
[UTC Mar 20 17:23:32] error    : 'bbs' process is not running
[UTC Mar 20 17:23:32] info     : 'bbs' trying to restart
[UTC Mar 20 17:23:32] info     : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 17:23:43] info     : 'bbs' process is running with pid 4322
[UTC Mar 20 18:08:55] error    : 'bbs' process is not running
[UTC Mar 20 18:08:55] info     : 'bbs' trying to restart
[UTC Mar 20 18:08:55] info     : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 18:09:06] info     : 'bbs' process is running with pid 6355
[UTC Mar 20 20:43:55] error    : 'bbs' process is not running
[UTC Mar 20 20:43:55] info     : 'bbs' trying to restart
[UTC Mar 20 20:43:55] info     : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 20:44:06] info     : 'bbs' process is running with pid 13117
[UTC Mar 20 21:08:48] error    : 'bbs' process is not running
[UTC Mar 20 21:08:48] info     : 'bbs' trying to restart
[UTC Mar 20 21:08:48] info     : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 21:08:59] info     : 'bbs' process is running with pid 14270
[UTC Mar 20 21:15:29] error    : 'bbs' process is not running
[UTC Mar 20 21:15:29] info     : 'bbs' trying to restart
[UTC Mar 20 21:15:29] info     : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 20 21:15:40] info     : 'bbs' process is running with pid 14633
[UTC Mar 21 01:00:53] error    : 'bbs' process is not running
[UTC Mar 21 01:00:53] info     : 'bbs' trying to restart
[UTC Mar 21 01:00:53] info     : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 21 01:01:04] info     : 'bbs' process is running with pid 24474
[UTC Mar 21 03:37:42] error    : 'bbs' process is not running
[UTC Mar 21 03:37:42] info     : 'bbs' trying to restart
[UTC Mar 21 03:37:42] info     : 'bbs' start: /var/vcap/jobs/bbs/bin/bbs_ctl
[UTC Mar 21 03:37:53] info     : 'bbs' process is running with pid 31315

/var/vcap/sys/log/bbs/bbs.stderr.log from diego-api instance shows following error messages, same time shows in above log i.e. around 18:08:55 when monit shows bbs process is not running.

[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 connection.go:96: invalid connection
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
2018/03/20 18:08:54 http: multiple response.WriteHeader calls
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:54 packets.go:36: unexpected EOF

Multiple error messages under /var/vcap/sys/log/bbs/bbs.stdout.log under diego-api instance. Please check the attached zip file for more information.

/var/vcap/sys/log/locket/locket.stderr.log from diego-api instance shows following error message, timing is same.

[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
2018/03/20 18:08:54 grpc: Server.processUnaryRPC failed to write status: stream error: code = 4 desc = "context deadline exceeded"

/var/vcap/sys/log/locket/locket.stdout.log from diego-api instance shows following error message, not sure about timing.

{"timestamp":"1521520927.388530254","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","log_level":2,"data":{"error":"rpc error: code = 5 desc = resource-not-found","key":"tps_watcher","modified-index":42,"session":"312.2","type":"lock"}}
{"timestamp":"1521520927.478413343","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","log_level":2,"data":{"error":"rpc error: code = 5 desc = resource-not-found","key":"auctioneer","modified-index":352,"session":"314.2","type":"lock"}}

Metron agent shows following message for same time on diego-api instance

2018/03/20 18:08:54 Failed to receive data: rpc error: code = Canceled desc = context canceled

/var/vcap/sys/log/silk-controller/silk-controller.stderr.log from diego-api instance shows following

[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36998->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36992->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36990->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36994->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36996->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:36998->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:48 packets.go:36: read tcp 192.168.15.16:37000->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:53 packets.go:36: read tcp 192.168.15.16:37014->192.168.15.15:3306: i/o timeout
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF
[mysql] 2018/03/20 18:08:53 packets.go:36: unexpected EOF

As such no critical error messages on database instance.

following message under api instance for cloud_controller_ng job under cloud_controller_worker_ctl.err.log

[2018-03-20 18:08:54+0000] Sequel::DatabaseDisconnectError: Mysql2::Error: Lost connection to MySQL server during query

following message under cc-worker instance for cloud_controller_worker job

{"timestamp":1521569333.571033,"message":"2018-03-20T18:08:53+0000: [Worker(cc_global_worker.cc-worker.0.1)] Error while reserving job: Mysql2::Error: Lost connection to MySQL server during query","log_level":"info","source":"cc-worker","data":{},"thread_id":47087408492380,"fiber_id":47087433721400,"process_id":11362,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.4.0/gems/delayed_job-4.1.3/lib/delayed/worker.rb","lineno":285,"method":"say"}

Lock collision messages for scheduler instance tps job

{"timestamp":"1521556797.335353374","source":"tps-watcher","message":"tps-watcher.locket-lock.failed-to-acquire-lock","log_level":2,"data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"tps_watcher","owner":"5db0c889-7346-48e3-5843-a8681e9cbb8c","type":"lock","type_code":1},"session":"2","ttl_in_seconds":15}}

check attached zip file which contains all instances log except api and router

Runtime.zip

prashantgnet commented 6 years ago

I am validating cf-deployment environment using single network and single router with single interface configuration just to make sure that there is no bottleneck on IaaS side.

APShirley commented 6 years ago

Looking at the database proxy logs, it looks again like the proxy lost the connection to the database. This is surprising since they are co-located on the same VM. It seems like there is some problem going on in the IAAS. Unsure what it could be though.

The fact that many other VMs had problems at the exact same time adds evidence to the theory that there was an IAAS level event at that point in time.

prashantgnet commented 6 years ago

Thanks for the update, I am working with our IaaS support team and will update you.

prashantgnet commented 6 years ago

I am not able to re-produce mentioned issue for new tenant under Openstack so closing this issue. Thanks a lot for debugging and providing valuable inputs for the same. Thanks ~ Prashant