ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 579 forks source link

Stop of rabbit app within start_rmq_server_app (OCF `rabbitmq-server-ha`) #1833

Open f-schie opened 1 year ago

f-schie commented 1 year ago

Hi,

in the OCF file rabbitmq-server-ha, I don't understand why after a successful start, the function stop_rmq_server_app is called. https://github.com/ClusterLabs/resource-agents/blob/50b6cd7d363e1208379c12349a2f3a2a83b8999c/heartbeat/rabbitmq-server-ha#L1402

As seen in snippet below, why would I want to stop the rmq server app when it just was started successfully as master of cluster:

    if [ $rc -eq $OCF_SUCCESS ] ; then
        # rabbitmq-server started successfuly as master of cluster
        master_score $MIN_MASTER_SCORE
        stop_rmq_server_app
        rc=$?
        if [ $rc -ne 0 ] ; then
            ocf_log err "${LH} RMQ-server app can't be stopped. Beam will be killed."
            kill_rmq_and_remove_pid
            unblock_client_access "${LH}"
            return $OCF_ERR_GENERIC
        fi

Clearly I am missing something, could someone please explain why it is done this way? @bogdando maybe can you help me out here?

We are using OCF rabbitmq-server-ha within a pacemaker cluster of 3 nodes and experience slow starts and a somewhat strange master election of the RabbitMQ master (newly booted node tears down active master and starts its own promotion...)

bogdando commented 1 year ago

Firstly, thank you for using this agent and taking care of its health!

In the repositroy from which this OCF agent originates (now in openstack-archive), there had been a related change and the corresponding gerrit change. There are some related LP bugs linked in the commit message for more context.

For the record: setting master_score 1 (minimal positive master-score for this node) means that the application is stopping on a non-master node (all of them). The master takes master_score 1000 normally, and the node which never should be promoted takes the score 0.

So, as the follow-up fix clarifies, we want to stop to only test it, if the app can be started "for real". There had been some corner cases around application reports started, but in fact is not functioning properly. The linked lp bug explain that in details. FWIW, we want to make sure that the app can be stopped w/o errors, after we have started it. And if it cannot, the mnesia DB will be cleaned up, so that for the next time pacemaker runs monitor or processes other events, it should start w/o problems (most likely!)

bogdando commented 1 year ago

By the way, there is some automation around customized Jepsen tests, which I used to run from time to time in a fork of rabbitmq-server repo, via github actions

It used to always reassemble the cluster upon network partitions caused by the testing framework, and allowed the test to complete. I no longer maintain that automation and fork, as we moved the script from rabbitmq-server repo to this new home. Having that jepsen-CI around here could be a good idea...

bogdando commented 1 year ago

newly booted node tears down active master and starts its own promotion

this could be a valid issue, and would also explain suboptimal testing results in Jepsen (many pending messages)

f-schie commented 1 year ago

Thanks for the quick reply! So if I am understanding correctly, it is OK (and partially expected) to have a scenario like this:

  1. Initiate a restart of msRabbitMQ via Pacemaker
  2. RabbitMQ is stopped via action_stop() (which calls stop_server_process())
  3. After successfully stopping everything, action_start() invokes:
  4. start_rmq_server_app() leading to starting the RMQ-server app via try_to_start_rmq_app() as
  5. /usr/sbin/rabbitmqctl start_app is called within and even if successful, stop_app is executed to verify the correct stop/start behavior
  6. If stop_app fails, Mnesia DB is reset and whole process is repeated

Resetting the mnesia DB means losing all durable exchanges/queues and the data in those, doesn't it? If there is a cluster outage because of power loss, it means all data that has not been processed is lost upon recovery?

Regarding my other scenario:

newly booted node tears down active master and starts its own promotion

I need to investigate this further. The remaining node receives a notify followed by a demote action by the DC as soon as the "old" master reboots.

Thanks for the link to the automation repo - I'll check this out!

bogdando commented 1 year ago

Resetting mnesia DB is the standard handling for the start/stop/join and the like unrecoverable failures. When using HA queues, or raft queues (which also requires durable queues), perhaps the data loss can be minimized.