cloudfoundry / bosh

Cloud Foundry BOSH is an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services.
https://bosh.io
Apache License 2.0
2.03k stars 658 forks source link

Another 400007 error: consul_z1/0 is not running after update. #1748

Closed cwb124 closed 7 years ago

cwb124 commented 7 years ago

I've seen several issues similar to mine but none of the fixes seem to apply. Here is the pertinent info I think can be helpful in troubleshooting:

Here is the actual error after I do a BOSH DEPLOY:

**Started updating instance consul_z1 > consul_z1/77424cef-3304-446f-b518-5c4aa7b22a25 (0) (canary). Failed: 'consul_z1/0 (77424cef-3304-446f-b518-5c4aa7b22a25)' is not running after update. Review logs for failed jobs: consul_agent, metron_agent (00:12:39)

Error 400007: 'consul_z1/0 (77424cef-3304-446f-b518-5c4aa7b22a25)' is not running after update. Review logs for failed jobs: consul_agent, metron_agent**

bosh stemcells: bosh-vsphere-esxi-ubuntu-trusty-go_agent | ubuntu-trusty | 3421.11* | sc-a4954fde-2446-49b6-8804-cd66eddbd30d

bosh instances --ps: +----------------------------------------------------------------------------+---------+-----+-----------+----------------+ | Instance | State | AZ | VM Type | IPs | +----------------------------------------------------------------------------+---------+-----+-----------+----------------+ | api_worker_z1/0 (814fa456-2879-4af2-8ad4-98fbe6bf113b) | running | n/a | small_z1 | 172.28.158.30 | +----------------------------------------------------------------------------+---------+-----+-----------+----------------+ | api_z1/0 (3287ac60-2fb0-4ea7-9377-359131442780) | running | n/a | large_z1 | 172.28.158.28 | +----------------------------------------------------------------------------+---------+-----+-----------+----------------+ | blobstore_z1/0 (2d7b22d5-2bdc-4c3d-93b5-52a70af1bc5e) | running | n/a | medium_z1 | 172.28.158.26 | +----------------------------------------------------------------------------+---------+-----+-----------+----------------+ | clock_z1/0 (3bdee9b5-d038-4751-858b-9ebcaf02dcb7) | running | n/a | medium_z1 | 172.28.158.29 | +----------------------------------------------------------------------------+---------+-----+-----------+----------------+ | consul_z1/0 (77424cef-3304-446f-b518-5c4aa7b22a25) | failing | n/a | small_z1 | 172.28.158.117 | | consul_agent | failing | | | | | metron_agent | unknown | | | | +----------------------------------------------------------------------------+---------+-----+-----------+----------------+ | doppler_z1/0 (0b045131-4c3a-49dc-8954-64afe0439a91) | running | n/a | medium_z1 | 172.28.158.31 |

Here is an interesting bit I haven't seen anyone bring up yet. When I SSH to the consul instance and look in the /var/vcap/jobs/consul_agent/config/certs directory, I see the following:

-rw-r----- 1 vcap vcap 1 Aug 2 18:04 agent.crt -rw-r----- 1 vcap vcap 1 Aug 2 18:04 agent.key -rw-r----- 1 vcap vcap 1733 Aug 2 18:04 ca.crt -rw-r----- 1 vcap vcap 1489 Aug 2 18:04 server.crt -rw-r----- 1 vcap vcap 1653 Aug 2 18:04 server.key

The agent.crt and agent.key are empty files. This probably is a result of the error I see in the /var/vcap/sys/log/consul_agent/consul_agent.stdout.log:

-recursor=69.252.81.81"],"cmd":"/var/vcap/packages/consul/bin/consul"}} {"timestamp":"1501699080.103437662","source":"confab","message":"confab.agent-runner.run.success","log_level":1,"data":{}} {"timestamp":"1501699080.103584528","source":"confab","message":"confab.controller.boot-agent.agent-client.waiting-for-agent","log_level":1,"data":{}} ==> WARNING: LAN keyring exists but -encrypt given, using keyring ==> WARNING: WAN keyring exists but -encrypt given, using keyring ==> Starting Consul agent... ==> Error starting agent: Failed to start Consul server: Failed to parse any CA certificates

The consul_agent.stderr.log just has this: error during start: timeout exceeded: "Get http://127.0.0.1:8500/v1/agent/self: dial tcp 127.0.0.1:8500: getsockopt: connection refused"

Hope this is enough info and logs to get some help. I've been going in circles for two days around this. Thanks much in advance!

EDIT: Also here is the entry from the metron.log: Could not use GRPC creds for client: failed to load keypair: tls: failed to find any PEM data in certificate input

cwb124 commented 7 years ago

So I made some changes to the formatting for all the certs and keys under the consul section of the deployment manifest. I linted the yml and it's good. I reran bosh deploy and still failed but it's slightly different now. Instead of the consul agent failing and the metron agent being unknown, now the consul_agent is running and the metron_agent is unknown. A slight improvement.

The exact error is:

consul_z1/0 (d47b3957-676c-47c2-b1bd-07bb4af9732b)' is not running after update. Review logs for failed jobs: metron_agent (00:10:58)

Error 400007: 'consul_z1/0 (d47b3957-676c-47c2-b1bd-07bb4af9732b)' is not running after update. Review logs for failed jobs: metron_agent

The metron.log shows the same error as I pasted above, but now it's apparently the only roadblock I have to a successful deployment:

2017/08/02 21:04:51 Could not use GRPC creds for client: failed to load keypair: tls: failed to find any PEM data in certificate input

Should I still tinker with the cert format? Which specific cert/key is the issue here?

dpb587-pivotal commented 7 years ago

Hi - this is fairly release-specific so I'm not sure how helpful I can be, but from the BOSH perspective it sounds like it is deploying things correctly (according to the manifest it was given), and perhaps the deployment manifest was not fully configured.

I noticed the following documentation page related to generate keys for consul, which may be helpful? http://docs.cloudfoundry.org/deploying/common/consul-security.html

More specifically, perhaps you can double check that your deployment manifest has been sufficiently configured. If you're getting started with cf-release, I'd recommend checking out cf-deployment which provides a great base CF manifest that you can use.

I assume you're using the consul-release based on some of the log messages. I'd recommend you checkout that repository and perhaps their GitHub issues - they're more likely to have seen this error or know more information about it.

cwb124 commented 7 years ago

We can close the issue. I'm not sure what the fix was but I rebuilt the deployment manifest from the stub example on the Cloud Foundry docs and paid close attention to the certs and it worked just fine. Apologies for not having a solution for the next person who finds this.