cloudfoundry / bosh

Cloud Foundry BOSH is an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services.
https://bosh.io
Apache License 2.0
2.03k stars 658 forks source link

UAA Route_registrar is not running after update #1518

Closed StBurcher closed 7 years ago

StBurcher commented 7 years ago

Hi,

I have a problem with the route_registrar during a cf deployment. I will not start, but only for the UAA. All other registrar entries, i.e. Blobstore etc, are working. I got the following error message.

Director task 154 Started preparing deployment > Preparing deployment. Done (00:00:00)

Started preparing package compilation > Finding packages to compile. Done (00:00:01)

Started updating job uaa > uaa/0 (3fd560a3-ae9a-4bf9-a1de-70e33dee8783) (canary). Failed: 'uaa/0 (3fd560a3-ae9a-4bf9-a1de-70e33dee8783)' is not running after update. Review logs for failed jobs: route_registrar (00:02:34)

Error 400007: 'uaa/0 (3fd560a3-ae9a-4bf9-a1de-70e33dee8783)' is not running after update. Review logs for failed jobs: route_registrar

The UAA part of manifest looks like:

  • name: uaa instances: 1 vm_type: default azs: [z1] stemcell: ubuntu-trusty templates:
    • {name: uaa, release: cf}
    • {name: metron_agent, release: cf}
    • {name: route_registrar, release: cf} networks:
    • name: default static_ips: [10.0.0.8] properties: login: catalina_opts: -Xmx768m -XX:MaxPermSize=256m route_registrar: routes:
      • name: uaa registration_interval: 20s port: 8080 uris:
      • "uaa.<%= system_domain %>"
      • "*.uaa.<%= system_domain %>"
      • "login.<%= system_domain %>"
      • "*.login.<%= system_domain %>" uaa: admin: client_secret: password batch: password: password username: batch_user cc: client_secret: password scim: userids_enabled: true users:
      • admin|<%= cf_admin_pwd %>|scim.write,scim.read,openid,cloud_controller.admin,doppler.firehose,routing.router_groups.read uaadb: address: 10.0.0.5 databases:
      • {name: uaadb, tag: uaa} db_scheme: postgresql port: 5524 roles:
      • {name: uaaadmin, password: password, tag: admin}

/var/vcap/sys/log/route_registrar/route_registrar.stderr.log is empty.

Last entries in route_registrar.stdout.log are

{"timestamp":"1479736657.834472179","source":"Route Registrar","message":"Route Registrar.Registered routes successfully","log_level":1,"data":{}} {"timestamp":"1479736677.834125042","source":"Route Registrar","message":"Route Registrar.no healthchecker found for route","log_level":1,"data":{"route":{"Name":"uaa","Port":8082,"Tags":null,"URIs":["uaa.de.cloudlab.com",".uaa.de.cloudlab.com","login.de.cloudlab.com",".login.de.cloudlab.com"],"RouteServiceUrl":"","RegistrationInterval":20000000000,"HealthCheck":null}}} {"timestamp":"1479736677.834250212","source":"Route Registrar","message":"Route Registrar.Registering route","log_level":1,"data":{"route":{"Name":"uaa","Port":8082,"Tags":null,"URIs":["uaa.de.cloudlab.com",".uaa.de.cloudlab.com","login.de.cloudlab.com",".login.de.cloudlab.com"],"RouteServiceUrl":"","RegistrationInterval":20000000000,"HealthCheck":null}}} {"timestamp":"1479736677.834356546","source":"Route Registrar","message":"Route Registrar.Registered routes successfully","log_level":1,"data":{}}

Why I get this timeout?

dpb587-pivotal commented 7 years ago

When you ssh onto the VM, does monit summary show the route_registrar as running? If so it might just be taking longer than you expect to start up and, and increasing your update_watch_time may help.

StBurcher commented 7 years ago

After redeploying a got the error everywhere

Failed updating job doppler > doppler/0 (b0d7ef00-f2ad-4b17-9634-528af3e9e51b) (canary): 'doppler/0 (b0d7ef00-f2ad-4b17-9634-528af3e9e51b)' is not running after update. Review logs for failed jobs: doppler (00:01:14) Failed updating job loggregator_trafficcontroller > loggregator_trafficcontroller/0 (e532f1d7-bd62-4564-bb56-274076d00f63) (canary): 'loggregator_trafficcontroller/0 (e532f1d7-bd62-4564-bb56-274076d00f63)' is not running after update. Review logs for failed jobs: loggregator_trafficcontroller, metron_agent, route_registrar (00:01:15) Failed updating job route_emitter > route_emitter/0 (0d8a9b0e-bb72-48c1-a0ef-d749a7935953) (canary): 'route_emitter/0 (0d8a9b0e-bb72-48c1-a0ef-d749a7935953)' is not running after update. Review logs for failed jobs: consul_agent, route_emitter, metron_agent (00:01:22) Failed updating job router > router/0 (1ac5183f-2652-4409-9a7a-d4ebc641c323) (canary): 'router/0 (1ac5183f-2652-4409-9a7a-d4ebc641c323)' is not running after update. Review logs for failed jobs: gorouter, metron_agent, consul_agent (00:01:24) Failed updating job brain > brain/0 (b768c8b4-10ae-48ec-a38b-62b819a81d5d) (canary): 'brain/0 (b768c8b4-10ae-48ec-a38b-62b819a81d5d)' is not running after update. Review logs for failed jobs: consul_agent, auctioneer, converger, metron_agent (00:01:24) Failed updating job cc_bridge > cc_bridge/0 (aa452943-63a3-4777-86c1-f69db78b34f3) (canary): 'cc_bridge/0 (aa452943-63a3-4777-86c1-f69db78b34f3)' is not running after update. Review logs for failed jobs: consul_agent, stager, nsync_listener, nsync_bulker, tps_listener, tps_watcher, cc_uploader, metron_agent (00:01:28) Failed updating job access > access/0 (f198a985-f981-45f4-8384-bd12f9bb8995) (canary): 'access/0 (f198a985-f981-45f4-8384-bd12f9bb8995)' is not running after update. Review logs for failed jobs: consul_agent, ssh_proxy, metron_agent, file_server (00:01:29) Failed updating job api > api/0 (414b6459-253f-41f8-a227-641906b33988) (canary): Action Failed get_task: Task 331c4b02-3379-4fca-4828-34b14e6ee1bc result: 1 of 4 pre-start scripts failed. Failed Jobs: cloud_controller_ng. Successful Jobs: cloud_controller_worker, cloud_controller_clock, consul_agent. (00:01:56) Failed updating job cell > cell/0 (ab5d973c-0959-4e0e-9132-5b03a30a6fe8) (canary): 'cell/0 (ab5d973c-0959-4e0e-9132-5b03a30a6fe8)' is not running after update. Review logs for failed jobs: consul_agent, rep, garden, metron_agent (00:02:19) Failed updating job database > database/0 (75ebf467-81b7-413d-b5fc-942694402d2f) (canary): 'database/0 (75ebf467-81b7-413d-b5fc-942694402d2f)' is not running after update. Review logs for failed jobs: consul_agent, etcd, etcd_consistency_checker, bbs, metron_agent (00:02:35) Failed updating job uaa > uaa/0 (fe92dab6-299c-4fe2-ab07-4cf0aa653b17) (canary): 'uaa/0 (fe92dab6-299c-4fe2-ab07-4cf0aa653b17)' is not running after update. Review logs for failed jobs: metron_agent, route_registrar (00:02:38)

Error 400007: 'doppler/0 (b0d7ef00-f2ad-4b17-9634-528af3e9e51b)' is not running after update. Review logs for failed jobs: doppler

Monit summary on uaa shows:

Process 'uaa' running Process 'metron_agent' running Process 'route_registrar' Execution failed System 'system_localhost' running

the log file route_registrar.stderr.log:

panic: nats: No servers available for connection

goroutine 1 [running]: github.com/cloudfoundry-incubator/route-registrar/Godeps/_workspace/src/github.com/pivotal-golang/lager.(*logger).Fatal(0xc8200162a0, 0x7d9850, 0x12, 0x7fce1d89a028, 0xc820011280, 0x0, 0x0, 0x0) /var/vcap/packages/route_registrar/src/github.com/cloudfoundry-incubator/route-registrar/Godeps/_workspace/src/github.com/pivotal-golang/lager/logger.go:152 +0x698 main.main() /var/vcap/packages/route_registrar/src/github.com/cloudfoundry-incubator/route-registrar/main.go:85 +0x126f

goroutine 17 [syscall, locked to thread]: runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

goroutine 5 [syscall]: os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:22 +0x18 created by os/signal.init.1 /usr/local/go/src/os/signal/signal_unix.go:28 +0x37

goroutine 6 [select, locked to thread]: runtime.gopark(0x836d90, 0xc820022728, 0x7a59e0, 0x6, 0x42d918, 0x2) /usr/local/go/src/runtime/proc.go:185 +0x163 runtime.selectgoImpl(0xc820022728, 0x0, 0x18) /usr/local/go/src/runtime/select.go:392 +0xa64 runtime.selectgo(0xc820022728) /usr/local/go/src/runtime/select.go:212 +0x12 runtime.ensureSigM.func1() /usr/local/go/src/runtime/signal1_unix.go:227 +0x353 runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

cppforlife commented 7 years ago

@StBurcher seems like nats machines are not reachable/failing?

StBurcher commented 7 years ago

It getting confused. After removing the whole system and reinstalling it with CF-238 only UAA is not running. Error:

Director task 315 Started preparing deployment > Preparing deployment. Done (00:00:01)

Started preparing package compilation > Finding packages to compile. Done (00:00:01)

Started updating job uaa > uaa/0 (cff050ec-dc6b-4560-9a61-3282f92a3f83) (canary). Failed: 'uaa/0 (cff050ec-dc6b-4560-9a61-3282f92a3f83)' is not running after update. Review logs for failed jobs: metron_agent (00:03:34)

Error 400007: 'uaa/0 (cff050ec-dc6b-4560-9a61-3282f92a3f83)' is not running after update. Review logs for failed jobs: metron_agent

Task 315 error Are you sure you want to deploy? (type 'yes' to continue): yes

Director task 315 Started preparing deployment > Preparing deployment. Done (00:00:01)

Started preparing package compilation > Finding packages to compile. Done (00:00:01)

Started updating job uaa > uaa/0 (cff050ec-dc6b-4560-9a61-3282f92a3f83) (canary). Failed: 'uaa/0 (cff050ec-dc6b-4560-9a61-3282f92a3f83)' is not running after update. Review logs for failed jobs: metron_agent (00:03:34)

Error 400007: 'uaa/0 (cff050ec-dc6b-4560-9a61-3282f92a3f83)' is not running after update. Review logs for failed jobs: metron_agent

Task 315 error

But, bosh instances --ps shows.

| uaa/0 (cff050ec-dc6b-4560-9a61-3282f92a3f83)* | running | INDIA | medium | 10.0.0.40 | | uaa | running | | | | | metron_agent | running | | | | | route_registrar | running | | | |

dpb587-pivotal commented 7 years ago

Since metron_agent was listed as not running, but it was running when you ran bosh instances --ps, you might just need to increase the canary_watch_time/update_watch_time (http://bosh.io/docs/deployment-manifest.html#update). It might have just needed some additional time to start depending on your environment.

StBurcher commented 7 years ago

Hi,

thank you that worked. I have change the time.