When running "bosh deploy" to upgrade the rabbit_node_ng job, bosh will stop all the jobs on the vm first and then unmount the persistent disk. But sometimes stopping the jobs can't stop the "beam" processes (created by rabbit_node job), so the persistent disk is still used by "beam" and can't be unmounted. It will cause the bosh deploy fail. To fix it, we need to log in to the rabbit_node_ng vm and kill all the beam processes and re-run bosh deploy. With this problem, we can't run bosh deploy fully automatically without manual interference.
We investigated that the problem is caused by some warden processes (the father process of beam processes) can't be stopped by stopping rabbit_node job (/var/vcap/jobs/rabbit_node_ng/bin/rabbit_node_ctl stop). Normally, when bosh stops rabbit_node job, the warden processes (like wshd: 19gdipma38k) will be killed so the beam processes will be killed along with them. But sometimes the warden processes failed to be killed so the beam processes keep alive and occupy the persistent disk.
One way to fix this problem we can think of is to add the function to "/var/vcap/jobs/rabbit_node_ng/bin/rabbit_node_ctl stop" command to check if there are some warden processes still alive after "kill_and_wait $PIDFILE 60". If yes, kill them.
When running "bosh deploy" to upgrade the rabbit_node_ng job, bosh will stop all the jobs on the vm first and then unmount the persistent disk. But sometimes stopping the jobs can't stop the "beam" processes (created by rabbit_node job), so the persistent disk is still used by "beam" and can't be unmounted. It will cause the bosh deploy fail. To fix it, we need to log in to the rabbit_node_ng vm and kill all the beam processes and re-run bosh deploy. With this problem, we can't run bosh deploy fully automatically without manual interference.
We investigated that the problem is caused by some warden processes (the father process of beam processes) can't be stopped by stopping rabbit_node job (/var/vcap/jobs/rabbit_node_ng/bin/rabbit_node_ctl stop). Normally, when bosh stops rabbit_node job, the warden processes (like wshd: 19gdipma38k) will be killed so the beam processes will be killed along with them. But sometimes the warden processes failed to be killed so the beam processes keep alive and occupy the persistent disk.
One way to fix this problem we can think of is to add the function to "/var/vcap/jobs/rabbit_node_ng/bin/rabbit_node_ctl stop" command to check if there are some warden processes still alive after "kill_and_wait $PIDFILE 60". If yes, kill them.