cloudfoundry / loggregator-release

Cloud Native Logging
Apache License 2.0
215 stars 149 forks source link

Error 400007: 'etcd_z1/0 (4290a801-094d-420d-96a1-cab8441cf549)' is not running after update. Review logs for failed jobs: etcd, etcd_consistency_checker, metron_agent #196

Closed harshalk91 closed 7 years ago

harshalk91 commented 7 years ago

I am trying to deploy cloud foundry. But i am getting this error during bosh deploy command Here is the error:-

ubuntu@bosh:~/my-bosh/final_cf_yml$ bosh instances --ps
Acting as user 'admin' on deployment 'dev' on 'my-bosh'

Director task 244

Task 244 done

+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| Instance                                                                   | State   | AZ  | VM Type   | IPs           |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| api_z1/0 (d0f4bccb-aaa7-4782-b17f-1011c76c6ca3)*                           | running | n/a | large_z1  | 192.168.1.87  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| blobstore_z1/0 (c1fcd081-c636-4645-8c48-bdbea9fb0d4a)*                     | running | n/a | medium_z1 | 192.168.1.89  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| consul_z1/0 (88d8ae24-fa63-41dc-a677-84a6b570a323)*                        | failing | n/a | small_z1  | 192.168.1.73  |
|   consul_agent                                                             | running |     |           |               |
|   metron_agent                                                             | failing |     |           |               |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| doppler_z1/0 (9d43051f-a828-4bcb-aa64-a8743edc08b8)*                       | running | n/a | medium_z1 | 192.168.1.84  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| etcd_z1/0 (4290a801-094d-420d-96a1-cab8441cf549)*                          | failing | n/a | large_z1  | 192.168.1.69  |
|   consul_agent                                                             | running |     |           |               |
|   etcd                                                                     | unknown |     |           |               |
|   etcd_consistency_checker                                                 | failing |     |           |               |
|   etcd_metrics_server                                                      | running |     |           |               |
|   metron_agent                                                             | failing |     |           |               |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| ha_proxy_z1/0 (f228f861-881a-44dc-b732-d59be31197ad)*                      | running | n/a | router_z1 | 192.168.1.61  |
|                                                                            |         |     |           | 10.104.160.87 |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| hm9000_z1/0 (850b809b-63e7-431d-b9b8-3a139bcd8f9b)*                        | running | n/a | medium_z1 | 192.168.1.86  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| loggregator_trafficcontroller_z1/0 (49198284-aa78-40ca-9ded-d1668d423ccf)* | running | n/a | small_z1  | 192.168.1.83  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| nats_z1/0 (de7219bd-f51f-4d26-95e7-83159c9f092a)*                          | failing | n/a | medium_z1 | 192.168.1.63  |
|   consul_agent                                                             | running |     |           |               |
|   nats                                                                     | running |     |           |               |
|   nats_stream_forwarder                                                    | running |     |           |               |
|   metron_agent                                                             | failing |     |           |               |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| postgres_z1/0 (a75068b4-9d08-4ad5-a17d-d55bb1dec8bf)*                      | running | n/a | medium_z1 | 192.168.1.65  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| router_z1/0 (1bf1eeb3-79e0-4ed7-aec8-183667ba24e5)*                        | running | n/a | router_z1 | 192.168.1.66  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| runner_z1/0 (eeba8b8e-0b40-48d8-be12-f8a9945cb1de)*                        | running | n/a | runner_z1 | 192.168.1.85  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| stats_z1/0 (53fba011-f68a-42f4-95c1-316c4f5e1ab4)*                         | running | n/a | small_z1  | 192.168.1.90  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
| uaa_z1/0 (59aa0e71-0754-4629-b329-cec75d31d457)*                           | running | n/a | medium_z1 | 192.168.1.88  |
+----------------------------------------------------------------------------+---------+-----+-----------+---------------+
etcd_z1/4290a801-094d-420d-96a1-cab8441cf549:/var/vcap/data/sys/log/etcd# ls -ltr
total 12
-rw-r----- 1 vcap vcap    0 Jan 19 15:46 pre-start.stdout.log
-rw-r----- 1 vcap vcap  516 Jan 19 15:46 pre-start.stderr.log
-rw-r--r-- 1 vcap vcap    0 Jan 19 15:47 etcd_consistency_checker_ctl.log
-rw-r--r-- 1 vcap vcap 7657 Jan 19 16:52 etcd_consistency_checker_ctl.err.log
etcd_z1/4290a801-094d-420d-96a1-cab8441cf549:/var/vcap/data/sys/log/etcd# tailf etcd_consistency_checker_ctl.err.log
[2017-01-19 16:48:04+0000] + set +x
[2017-01-19 16:48:04+0000] /var/vcap/jobs/etcd/bin/etcd_bosh_utils.sh: line 348: unexpected EOF while looking for matching `''
[2017-01-19 16:50:15+0000] + main start
[2017-01-19 16:50:15+0000] + case "${1}" in
[2017-01-19 16:50:15+0000] + set +x
[2017-01-19 16:50:15+0000] /var/vcap/jobs/etcd/bin/etcd_bosh_utils.sh: line 348: unexpected EOF while looking for matching `''
[2017-01-19 16:52:26+0000] + main start
[2017-01-19 16:52:26+0000] + case "${1}" in
[2017-01-19 16:52:26+0000] + set +x
[2017-01-19 16:52:26+0000] /var/vcap/jobs/etcd/bin/etcd_bosh_utils.sh: line 348: unexpected EOF while looking for matching `''
etcd_z1/4290a801-094d-420d-96a1-cab8441cf549:/var/vcap/data/sys/log/etcd# tailf pre-start.stderr.log
+ main
+ create_directories_and_chown_to_vcap
+ mkdir -p /var/vcap/sys/log/etcd
+ chown -R vcap:vcap /var/vcap/sys/log/etcd
+ mkdir -p /var/vcap/store/etcd
+ chown -R vcap:vcap /var/vcap/store/etcd
+ mkdir -p /var/vcap/jobs/etcd/config/certs
+ chown -R vcap:vcap /var/vcap/jobs/etcd/config/certs
+ mkdir -p /var/vcap/sys/run/etcd
+ chown -R vcap:vcap /var/vcap/sys/run/etcd

Here are etcd_metrics_server logs

etcd_z1/4290a801-094d-420d-96a1-cab8441cf549:/var/vcap/data/sys/log/etcd_metrics_server# ls -ltr
total 80
-rw-r----- 1 vcap vcap     0 Jan 19 15:46 pre-start.stdout.log
-rw-r----- 1 vcap vcap   525 Jan 19 15:46 pre-start.stderr.log
-rw-r--r-- 1 vcap vcap   120 Jan 19 15:48 etcd_metrics_server_ctl.log
-rw-r--r-- 1 vcap vcap     0 Jan 19 15:48 etcd_metrics_server.stderr.log
-rw-r--r-- 1 vcap vcap  1785 Jan 19 15:48 etcd_metrics_server_ctl.err.log
-rw-r--r-- 1 vcap vcap 65076 Jan 19 16:56 etcd_metrics_server.stdout.log
etcd_z1/4290a801-094d-420d-96a1-cab8441cf549:/var/vcap/data/sys/log/etcd_metrics_server# tail -1  etcd_metrics_server.stdout.log
{"timestamp":"1484845085.324359179","source":"etcd-metrics-server","message":"etcd-metrics-server.failed-to-collect-store-stats","log_level":2,"data":{"error":"Get https://etcd-z1-0.cf-etcd.service.cf.internal:4001/v2/stats/store: dial tcp: lookup etcd-z1-0.cf-etcd.service.cf.internal on 127.0.0.1:53: no such host"}}
etcd_z1/4290a801-094d-420d-96a1-cab8441cf549:/var/vcap/data/sys/log/etcd_metrics_server# tailf etcd_metrics_server_ctl.err.log
[2017-01-19 15:48:05+0000] + '[' -f /var/vcap/sys/run/etcd_metrics_server/etcd_metrics_server.pid ']'
[2017-01-19 15:48:05+0000] + start_etcd_metrics_server
[2017-01-19 15:48:05+0000] + local node_name
[2017-01-19 15:48:05+0000] + node_name=etcd-z1-0
[2017-01-19 15:48:05+0000] + echo 8117
[2017-01-19 15:48:05+0000] + /var/vcap/packages/etcd_metrics_server/bin/etcd-metrics-server -index=0 -port=5678 -username= -password= -etcdScheme=https -cert=/var/vcap/jobs/etcd_metrics_server/certs/client.crt -key=/var/vcap/jobs/etcd_metrics_server/certs/client.key -caCert=/var/vcap/jobs/etcd_metrics_server/certs/ca.crt -etcdAddress=etcd-z1-0.cf-etcd.service.cf.internal:4001
[2017-01-19 15:48:05+0000] ++ logger -p user.error -t vcap.etcd_metrics_server
[2017-01-19 15:48:05+0000] ++ logger -p user.info -t vcap.etcd_metrics_server
[2017-01-19 15:48:05+0000] ++ tee -a /var/vcap/sys/log/etcd_metrics_server/etcd_metrics_server.stdout.log
[2017-01-19 15:48:05+0000] ++ tee -a /var/vcap/sys/log/etcd_metrics_server/etcd_metrics_server.stderr.log
etcd_z1/4290a801-094d-420d-96a1-cab8441cf549:/var/vcap/data/sys/log/etcd_metrics_server# tailf etcd_metrics_server_ctl.log
[2017-01-19 15:48:05+0000] ------------ STARTING etcd_metrics_server_ctl at Thu Jan 19 15:48:05 UTC 2017 --------------

etcd_z1/4290a801-094d-420d-96a1-cab8441cf549:/var/vcap/data/sys/log/etcd_metrics_server# tailf pre-start.stderr.log
+ LOG_DIR=/var/vcap/sys/log/etcd_metrics_server
+ CERTS_DIR=/var/vcap/jobs/etcd_metrics_server/certs
+ main
+ create_directories_and_chown_to_vcap
+ mkdir -p /var/vcap/sys/log/etcd_metrics_server
+ chown -R vcap:vcap /var/vcap/sys/log/etcd_metrics_server
+ mkdir -p /var/vcap/sys/run/etcd_metrics_server
+ chown -R vcap:vcap /var/vcap/sys/run/etcd_metrics_server
+ mkdir -p /var/vcap/jobs/etcd_metrics_server/certs
+ chown -R vcap:vcap /var/vcap/jobs/etcd_metrics_server/certs

Here is my etcd part from the manifest

- default_networks:
  - name: prvt
  instances: 1
  name: etcd_z1
  networks:
  - name: prvt
    static_ips:
    - 192.168.1.69
  persistent_disk: 10024
  properties:
    consul:
      agent:
        services:
          etcd:
            name: cf-etcd
    metron_agent:
      zone: z1
  resource_pool: large_z1
  templates:
  - name: consul_agent
    release: cf
  - name: etcd
    release: cf
  - name: etcd_metrics_server
    release: cf
  - name: metron_agent
    release: cf
  update:
    max_in_flight: 1
    serial: true
cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/138128541

The labels on this github issue will be updated when the story is started.

harshalk91 commented 7 years ago

@wfernandes @dsabeti Can you guys help me out? I am stuck on this since last couple of weeks. :)

Thanks

poy commented 7 years ago

@harshalk91 The consistency checker is responsible for ensuring etcd is not experiencing a split brain. It seems to be failing because there is one. Please see https://github.com/cloudfoundry-incubator/etcd-release#failure-recovery for how to resolve this.

harshalk91 commented 7 years ago

@apoydence Tried with following commands:

monit stop etcd
rm -rf /var/vcap/store/etcd/* 
monit start etcd
etcd_z1/51beb17f-30db-48b0-813a-04a0af79aae3:~# monit summary
The Monit daemon 5.2.5 uptime: 11h 51m

Process 'consul_agent'              running
Process 'etcd'                      not monitored
Process 'etcd_consistency_checker'  Execution failed
Process 'etcd_metrics_server'       running
Process 'metron_agent'              Does not exist
System 'system_localhost'           running

Facing same issue even after troubleshooting.

hev commented 7 years ago

@dsabeti any thoughts as to what might be occuring here?

harshalk91 commented 7 years ago

@ahevenor @dsabeti I have verified this from cf-247 to cf-251 release. Same issue occurs at bosh deploy command. Even i tried with scratch and redeploying director, re-uploading release and stemcells still issue persists.

dsabeti commented 7 years ago

@harshalk91 It seems to me that the consistency checker is failing because it can't reach the etcd node. It seems that it can't reach the etcd node because the node is not yet alive:

| etcd_z1/0 (4290a801-094d-420d-96a1-cab8441cf549)*  | failing | ...
|   consul_agent             | running |     ...
|   etcd                     | unknown |     ... <=== HERE
|   etcd_consistency_checker | failing | ...

Also here:

Process 'consul_agent'              running
Process 'etcd'                      not monitored <==== HERE
Process 'etcd_consistency_checker'  Execution failed

Your output above suggests that there are no logs for the etcd node itself. Can you double check that there isn't a file like /var/vcap/sys/log/etcd/etcd_ctl* or /var/vcap/sys/log/etcd/etcd*.log?

@evanfarrar @christianang Could we get your help debugging this?

christianang commented 7 years ago

@harshalk91 I think I might have a better idea of why etcd looks like it is not starting, I have continued the thread on https://github.com/cloudfoundry-incubator/etcd-release/issues/27 to help me keep track of it.

jasonkeene commented 7 years ago

I'm closing this issue since conversation has moved to https://github.com/cloudfoundry-incubator/etcd-release/issues/27