cloudfoundry-attic / etcd-release

Apache License 2.0
3 stars 17 forks source link

monit fails to terminate etcd processes #33

Closed Jason-Crowe closed 7 years ago

Jason-Crowe commented 7 years ago

i'm using cf-deployment.yml and etcd-release v89

- name: etcd
  url: https://bosh.io/d/github.com/cloudfoundry-incubator/etcd-release?v=89
  version: '89'

after issuing a bosh stop --hard of my deployment and then when trying to bring back up the instance groups one-by-one i end up having issues with etcd.

further when i investigate i find that there are sometimes several instances of the monit launched processes running.

here is an example process tree showing the initial state and then a bit later:

## from first iteration
# ps -fwwe --forest|grep etcd
root      7111 10845  0 23:26 pts/2    00:00:00                      \_ grep --color=auto etcd
root      5322     1  0 23:26 ?        00:00:00 /usr/bin/timeout 55 /var/vcap/jobs/etcd/bin/etcd_ctl_wrapper start
root      5323  5322  0 23:26 ?        00:00:00  \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl_wrapper start
root      5325  5323  0 23:26 ?        00:00:00      \_ sudo -u vcap /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5327  5325  0 23:26 ?        00:00:00          \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5330  5327  0 23:26 ?        00:00:00              \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5337  5334  0 23:26 ?        00:00:00              |   |   \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5339  5337  0 23:26 ?        00:00:00              |   |       \_ logger -p user.info -t vcap.etcd_ctl.stdout
vcap      5331  5327  0 23:26 ?        00:00:00              \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5336  5332  0 23:26 ?        00:00:00              |   |   \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5338  5336  0 23:26 ?        00:00:00              |   |       \_ logger -p user.error -t vcap.etcd_ctl.stderr
vcap      5369     1  0 23:26 ?        00:00:00 /bin/bash -xu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run.sh
vcap      5373  5369  0 23:26 ?        00:00:00  \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run_ctl.sh start
vcap      5379  5373  0 23:26 ?        00:00:00  |   \_ tee -a /var/vcap/sys/log/etcd/etcd-network-diagnostics.stderr.log
vcap      5380  5373  0 23:26 ?        00:00:00  |   \_ logger -p user.error -t vcap.etcd-network-diagnostics
vcap      5374  5369  0 23:26 ?        00:00:00  \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run_ctl.sh start
vcap      5377  5374  0 23:26 ?        00:00:00  |   \_ tee -a /var/vcap/sys/log/etcd/etcd-network-diagnostics.stdout.log
vcap      5378  5374  0 23:26 ?        00:00:00  |   \_ logger -p user.info -t vcap.etcd-network-diagnostics

## a bit later
# ps -fwwe --forest|grep etcd
root     14436 10845  0 23:28 pts/2    00:00:00                      \_ grep --color=auto etcd
vcap      5330     1  0 23:26 ?        00:00:00 /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5337  5334  0 23:26 ?        00:00:00  |   \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5339  5337  0 23:26 ?        00:00:00  |       \_ logger -p user.info -t vcap.etcd_ctl.stdout
vcap      5331     1  0 23:26 ?        00:00:00 /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5336  5332  0 23:26 ?        00:00:00  |   \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      5338  5336  0 23:26 ?        00:00:00  |       \_ logger -p user.error -t vcap.etcd_ctl.stderr
vcap      5369     1  0 23:26 ?        00:00:00 /bin/bash -xu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run.sh
vcap      5373  5369  0 23:26 ?        00:00:00  \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run_ctl.sh start
vcap      5379  5373  0 23:26 ?        00:00:00  |   \_ tee -a /var/vcap/sys/log/etcd/etcd-network-diagnostics.stderr.log
vcap      5380  5373  0 23:26 ?        00:00:00  |   \_ logger -p user.error -t vcap.etcd-network-diagnostics
vcap      5374  5369  0 23:26 ?        00:00:00  \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run_ctl.sh start
vcap      5377  5374  0 23:26 ?        00:00:00  |   \_ tee -a /var/vcap/sys/log/etcd/etcd-network-diagnostics.stdout.log
vcap      5378  5374  0 23:26 ?        00:00:00  |   \_ logger -p user.info -t vcap.etcd-network-diagnostics
vcap      8714     1  0 23:27 ?        00:00:00 /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      8722  8719  0 23:27 ?        00:00:00  |   \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      8725  8722  0 23:27 ?        00:00:00  |       \_ logger -p user.info -t vcap.etcd_ctl.stdout
vcap      8715     1  0 23:27 ?        00:00:00 /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      8721  8717  0 23:27 ?        00:00:00  |   \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_ctl start
vcap      8724  8721  0 23:27 ?        00:00:00  |       \_ logger -p user.error -t vcap.etcd_ctl.stderr
vcap      8747     1  0 23:27 ?        00:00:00 /bin/bash -xu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run.sh
vcap      8750  8747  0 23:27 ?        00:00:00  \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run_ctl.sh start
vcap      8754  8750  0 23:27 ?        00:00:00  |   \_ tee -a /var/vcap/sys/log/etcd/etcd-network-diagnostics.stderr.log
vcap      8755  8750  0 23:27 ?        00:00:00  |   \_ logger -p user.error -t vcap.etcd-network-diagnostics
vcap      8751  8747  0 23:27 ?        00:00:00  \_ /bin/bash -exu /var/vcap/jobs/etcd/bin/etcd_network_diagnostics_run_ctl.sh start
vcap      8752  8751  0 23:27 ?        00:00:00  |   \_ tee -a /var/vcap/sys/log/etcd/etcd-network-diagnostics.stdout.log
vcap      8753  8751  0 23:27 ?        00:00:00  |   \_ logger -p user.info -t vcap.etcd-network-diagnostics
cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/140723929

The labels on this github issue will be updated when the story is started.

evanfarrar commented 7 years ago

Hey! We've release v96 to fix the orphaning issue, please let us know if there are other issues still (we're not sure why your cluster wasn't syncing in over 55 seconds, so that could still be an issue).