docker-library / rabbitmq

Docker Official Image packaging for RabbitMQ
http://www.rabbitmq.com/
MIT License
785 stars 417 forks source link

Docker-based, Single-Node RabbitMQ 3.10 to 3.11 cluster upgrade won't start #601

Closed cwilso03 closed 1 year ago

cwilso03 commented 1 year ago

I have an existing Docker-based single-node RabbitMQ cluster, which I'm trying to upgrade from version 3.10-management-alpine to 3.11-management-alpine. When I spin up the new container (using a docker-compose.yml), I get the following error:

BOOT FAILED
2023-01-09T16:27:50.816872654Z ===========
2023-01-09T16:27:50.816892589Z Error during startup: {error,failed_to_initialize_feature_flags_registry}
2023-01-09T16:27:50.816903007Z 
2023-01-09T16:27:50.816765449Z 2023-01-09 16:27:50.816607+00:00 [error] <0.229.0> 
2023-01-09T16:27:50.816907194Z 2023-01-09 16:27:50.816607+00:00 [error] <0.229.0> BOOT FAILED
2023-01-09T16:27:50.816908598Z 2023-01-09 16:27:50.816607+00:00 [error] <0.229.0> ===========
2023-01-09T16:27:50.816910639Z 2023-01-09 16:27:50.816607+00:00 [error] <0.229.0> Error during startup: {error,failed_to_initialize_feature_flags_registry}
2023-01-09T16:27:50.816912183Z 2023-01-09 16:27:50.816607+00:00 [error] <0.229.0> 
2023-01-09T16:27:51.824192703Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>   crasher:
2023-01-09T16:27:51.824220001Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     initial call: application_master:init/4
2023-01-09T16:27:51.824226726Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     pid: <0.228.0>
2023-01-09T16:27:51.824228840Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     registered_name: []
2023-01-09T16:27:51.824230536Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     exception exit: {failed_to_initialize_feature_flags_registry,
2023-01-09T16:27:51.824231929Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>                         {rabbit,start,[normal,[]]}}
2023-01-09T16:27:51.824233176Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>       in function  application_master:init/4 (application_master.erl, line 142)
2023-01-09T16:27:51.824234434Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     ancestors: [<0.227.0>]
2023-01-09T16:27:51.824235750Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     message_queue_len: 1
2023-01-09T16:27:51.824236968Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     messages: [{'EXIT',<0.229.0>,normal}]
2023-01-09T16:27:51.824238150Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     links: [<0.227.0>,<0.44.0>]
2023-01-09T16:27:51.824239538Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     dictionary: []
2023-01-09T16:27:51.824240802Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     trap_exit: true
2023-01-09T16:27:51.824241948Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     status: running
2023-01-09T16:27:51.824243050Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     heap_size: 233
2023-01-09T16:27:51.824244192Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     stack_size: 28
2023-01-09T16:27:51.824245309Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>     reductions: 162
2023-01-09T16:27:51.824246419Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0>   neighbours:
2023-01-09T16:27:51.824247561Z 2023-01-09 16:27:51.817408+00:00 [error] <0.228.0> 
2023-01-09T16:27:51.824248732Z 2023-01-09 16:27:51.824019+00:00 [notice] <0.44.0> Application rabbit exited with reason: {failed_to_initialize_feature_flags_registry,{rabbit,start,[normal,[]]}}
2023-01-09T16:27:53.334406595Z {"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{failed_to_initialize_feature_flags_registry,{rabbit,start,[normal,[]]}}}"}
2023-01-09T16:27:53.334421245Z Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{failed_to_initialize_feature_flags_registry,{rabbit,start,[normal,[]]}}})
2023-01-09T16:27:53.334513828Z 
2023-01-09T16:27:53.352036844Z Crash dump is being written to: erl_crash.dump...2023-01-09T16:27:56.115100778Z 2023-01-09 16:27:56.106066+00:00 [error] <0.229.0> Feature flags: `maintenance_mode_status`: required feature flag not enabled! It must be enabled before upgrading RabbitMQ.
2023-01-09T16:27:56.116072231Z 2023-01-09 16:27:56.114947+00:00 [error] <0.229.0> Failed to initialize feature flags registry: {disabled_required_feature_flag,
2023-01-09T16:27:56.116107861Z 2023-01-09 16:27:56.114947+00:00 [error] <0.229.0>                                               maintenance_mode_status}

I've read about the 3.11 change of requiring previously optional feature flags. While I don't understand why a single-node cluster would need to enable maintenance_mode_status (as a rolling upgrade would never be possible), I nonetheless added an environment variable RABBITMQ_FEATURE_FLAGS and set it to maintenance_mode_status. I still get the above error.

lukebakken commented 1 year ago

Since you didn't share the files you're using, I'm guessing at how to reproduce this issue. Please see this repository:

https://github.com/lukebakken/docker-library_rabbitmq-601

I can do an upgrade from 3.10 to 3.11 without the issue you report. If you would like further assistance, you'll have to provide all of the necessary information to reproduce this issue.

cwilso03 commented 1 year ago

Hi @lukebakken, thanks for taking the time to create a test scenario (definitely above-and-beyond!). After seeing that, I tried creating a test project to isolate the RabbitMQ portion from the rest of my project, and found that it, too, upgraded just fine.

Ultimately, I resolved the problem in my main project by deleting the Docker volume I had attached (via compose) to /var/lib/rabbitmq. After doing that and restarting 3.10, then changing the version to 3.11, the upgrade worked. Not sure what 3.11 didn't like about that volume, but the issue is resolved, from my perspective.

Thanks again!

michaelklishin commented 1 year ago

disabled_required_feature_flag means that the original data directory did not have some or any feature flags that 3.11.x requires enabled.

By wiping the data directory you've made the node automatically enable all feature flags on first boot. For existing installation that's not done and is up to you to make sure all feature flags are enabled before the upgrade to 3.11.

lukebakken commented 1 year ago

@michaelklishin beat me to it

cwilso03 commented 1 year ago

By wiping the data directory you've made the node automatically enable all feature flags on first boot. For existing installation that's not done and is up to you to make sure all feature flags are enabled before the upgrade to 3.11.

Understood. So, why didn't enabling the maintenance_mode_status flag via the RABBITMQ_FEATURE_FLAGS environment variable work?

serut commented 1 year ago

Indeed, RABBITMQ_FEATURE_FLAGS does not allow to active maintenance_mode_status (RabbitMQ 3.8.10):

$ env
[...]
RABBITMQ_FEATURE_FLAGS=maintenance_mode_status
[...]

And RabbitMQ logs :

2023-01-27 18:07:34.864 [info] <0.272.0> Feature flags: list of feature flags found:
2023-01-27 18:07:34.865 [info] <0.272.0> Feature flags:   [x] drop_unroutable_metric
2023-01-27 18:07:34.865 [info] <0.272.0> Feature flags:   [x] empty_basic_get_metric
2023-01-27 18:07:34.865 [info] <0.272.0> Feature flags:   [x] implicit_default_bindings
2023-01-27 18:07:34.865 [info] <0.272.0> Feature flags:   [ ] maintenance_mode_status
2023-01-27 18:07:34.865 [info] <0.272.0> Feature flags:   [x] quorum_queue
2023-01-27 18:07:34.865 [info] <0.272.0> Feature flags:   [ ] user_limits
2023-01-27 18:07:34.865 [info] <0.272.0> Feature flags:   [x] virtual_host_metadata