VM state is failing after it is stopped and started

bingosummer commented 9 years ago

I use bosh to deploy a multi-vm cloud foundry, and there are two runners for DEA.

I stopped runner_z1/1 and started it. I expect runner_z1/1 automatically come to running state, but it was always starting and failing.

+------------------------------------+---------+----------------------------------------+----------------+
| Job/index                          | State   | Resource Pool                          | IPs            |
+------------------------------------+---------+----------------------------------------+----------------+
| api_z1/0                           | running | resource_api                           | 10.0.16.101    |
| etcd_z1/0                          | running | resource_z1                            | 10.0.16.14     |
| ha_proxy_z1/0                      | running | resource_z1                            | 10.0.16.4      |
|                                    |         |                                        | 168.63.204.149 |
| hm9000_z1/0                        | running | resource_hm                            | 10.0.16.102    |
| loggregator_trafficcontroller_z1/0 | running | resource_loggregator_trafficcontroller | 10.0.16.104    |
| loggregator_z1/0                   | running | resource_loggregator                   | 10.0.16.103    |
| login_z1/0                         | running | resource_login                         | 10.0.16.105    |
| nats_z1/0                          | running | resource_nats                          | 10.0.16.13     |
| nfs_z1/0                           | running | resource_z1                            | 10.0.16.15     |
| postgres_z1/0                      | running | resource_z1                            | 10.0.16.11     |
| router_z1/0                        | running | resource_router                        | 10.0.16.12     |
| runner_z1/0                        | running | resource_runner                        | 10.0.16.106    |
| runner_z1/1                        | failing | resource_runner                        | 10.0.16.107    |
| stats_z1/0                         | running | resource_z1                            | 10.0.16.109    |
+------------------------------------+---------+----------------------------------------+----------------+

I ssh into runner_z1/1.

root@cb282e3f-0107-43ab-965a-b738a304960d:~# monit summary
/var/vcap/monit/job/0002_dea_next.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/dea_next/bin/warden_ctl'
/var/vcap/monit/job/0002_dea_next.monitrc:5: Warning: the executable does not exist '/var/vcap/jobs/dea_next/bin/warden_ctl'
/var/vcap/monit/job/0002_dea_next.monitrc:17: Warning: the executable does not exist '/var/vcap/jobs/dea_next/bin/dea_ctl'
/var/vcap/monit/job/0002_dea_next.monitrc:18: Warning: the executable does not exist '/var/vcap/jobs/dea_next/bin/dea_ctl'
/var/vcap/monit/job/0002_dea_next.monitrc:24: Warning: the executable does not exist '/var/vcap/jobs/dea_next/bin/dir_server_ctl'
/var/vcap/monit/job/0002_dea_next.monitrc:25: Warning: the executable does not exist '/var/vcap/jobs/dea_next/bin/dir_server_ctl'
/var/vcap/monit/job/0001_dea_logging_agent.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/dea_logging_agent/bin/dea_logging_agent_ctl'
/var/vcap/monit/job/0001_dea_logging_agent.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/dea_logging_agent/bin/dea_logging_agent_ctl'
/var/vcap/monit/job/0000_metron_agent.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/metron_agent/bin/metron_agent_ctl'
/var/vcap/monit/job/0000_metron_agent.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/metron_agent/bin/metron_agent_ctl'
The Monit daemon 5.2.4 uptime: 1h 18m

Process 'warden'                    Execution failed
Process 'dea_next'                  initializing
Process 'dir_server'                Execution failed
Process 'dea_logging_agent'         Execution failed
Process 'metron_agent'              not monitored
System 'system_cb282e3f-0107-43ab-965a-b738a304960d' running

The executable does not exist because the directory /var/vcap/data is lost during the stopping\starting of VM. (/dev/sdb2 is an ephemeral disk)

root@cb282e3f-0107-43ab-965a-b738a304960d:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       2.8G  1.4G  1.3G  51% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            828M  4.0K  828M   1% /dev
tmpfs           168M  372K  168M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            839M     0  839M   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/sdb2        68G   57M   64G   1% /var/vcap/data
tmpfs           1.0M     0  1.0M   0% /var/vcap/data/sys/run
/dev/loop0      120M  1.6M  115M   2% /tmp

So my question is: Should I expect the VM automatically come to running state after the stopping/starting operation? If so, how should I fix the failing state? If not, should I delete the failing VM, recreate one and attach the persistent disk manually?

Thanks

cppforlife commented 9 years ago

Should I expect the VM automatically come to running state after the stopping/starting operation?

Nope since ephemeral disk is lost.

If so, how should I fix the failing state?

you can run bosh recreate job_name job_index

If not, should I delete the failing VM, recreate one and attach the persistent disk manually?

nope, just let bosh recreate do that for you.

Maybe the same issue happened in AWS?

yes it does.

BOSH expects that machines that were stopped outside of its control will be brought up through BOSH, either automatically via resurrector or manually via bosh cck/recreate.

bingosummer commented 9 years ago

Thanks @cppforlife for your quick response and kindly help.

cloudfoundry / bosh

VM state is failing after it is stopped and started #956