Version 1.3.2 Unable to start broker due to problem in systemctl

RobertFloor commented 1 year ago

SUMMARY

After updating to version 1.3.2 of the playbook I can no longer start the broker via systemctl. I can start the master via systemctl but the second broker fails to start. We are deploying a two node shared storage setup. We have setup an NFS mount. In version 1.3.0 and 1.3.1 the same playbook works fine.

ISSUE TYPE

Bug Report

ANSIBLE VERSION

ansible [core 2.14.5]
  config file = /home/robert/asb2/AMQ-Ansible-config/ansible-configuration/ansible.cfg
  configured module search path = ['/home/robert/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/linuxbrew/.linuxbrew/Cellar/ansible/7.5.0/libexec/lib/python3.11/site-packages/ansible
  ansible collection location = /home/robert/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/linuxbrew/.linuxbrew/bin/ansible
  python version = 3.11.3 (main, Apr  4 2023, 22:36:41) [GCC 11.3.0] (/home/linuxbrew/.linuxbrew/Cellar/ansible/7.5.0/libexec/bin/python3.11)
  jinja version = 3.1.2
  libyaml = True

COLLECTION VERSION

# /home/linuxbrew/.linuxbrew/Cellar/ansible/7.5.0/libexec/lib/python3.11/site-packages/ansible_collections
Collection                    Version
----------------------------- -------
amazon.aws                    5.4.0
ansible.netcommon             4.1.0
ansible.posix                 1.5.2
ansible.utils                 2.9.0
ansible.windows               1.13.0
arista.eos                    6.0.1
awx.awx                       21.14.0
azure.azcollection            1.15.0
check_point.mgmt              4.0.0
chocolatey.chocolatey         1.4.0
cisco.aci                     2.6.0
cisco.asa                     4.0.0
cisco.dnac                    6.7.1
cisco.intersight              1.0.27
cisco.ios                     4.5.0
cisco.iosxr                   4.1.0
cisco.ise                     2.5.12
cisco.meraki                  2.15.1
cisco.mso                     2.4.0
cisco.nso                     1.0.3
cisco.nxos                    4.3.0
cisco.ucs                     1.8.0
cloud.common                  2.1.3
cloudscale_ch.cloud           2.2.4
community.aws                 5.4.0
community.azure               2.0.0
community.ciscosmb            1.0.5
community.crypto              2.12.0
community.digitalocean        1.23.0
community.dns                 2.5.3
community.docker              3.4.3
community.fortios             1.0.0
community.general             6.6.0
community.google              1.0.0
community.grafana             1.5.4
community.hashi_vault         4.2.0
community.hrobot              1.8.0
community.libvirt             1.2.0
community.mongodb             1.5.2
community.mysql               3.6.0
community.network             5.0.0
community.okd                 2.3.0
community.postgresql          2.3.2
community.proxysql            1.5.1
community.rabbitmq            1.2.3
community.routeros            2.8.0
community.sap                 1.0.0
community.sap_libs            1.4.1
community.skydive             1.0.0
community.sops                1.6.1
community.vmware              3.5.0
community.windows             1.12.0
community.zabbix              1.9.3
containers.podman             1.10.1
cyberark.conjur               1.2.0
cyberark.pas                  1.0.17
dellemc.enterprise_sonic      2.0.0
dellemc.openmanage            6.3.0
dellemc.os10                  1.1.1
dellemc.os6                   1.0.7
dellemc.os9                   1.0.4
dellemc.powerflex             1.6.0
dellemc.unity                 1.6.0
f5networks.f5_modules         1.23.0
fortinet.fortimanager         2.1.7
fortinet.fortios              2.2.3
frr.frr                       2.0.2
gluster.gluster               1.0.2
google.cloud                  1.1.3
grafana.grafana               1.1.1
hetzner.hcloud                1.11.0
hpe.nimble                    1.1.4
ibm.qradar                    2.1.0
ibm.spectrum_virtualize       1.11.0
infinidat.infinibox           1.3.12
infoblox.nios_modules         1.4.1
inspur.ispim                  1.3.0
inspur.sm                     2.3.0
junipernetworks.junos         4.1.0
kubernetes.core               2.4.0
lowlydba.sqlserver            1.3.1
mellanox.onyx                 1.0.0
microsoft.ad                  1.0.0
netapp.aws                    21.7.0
netapp.azure                  21.10.0
netapp.cloudmanager           21.22.0
netapp.elementsw              21.7.0
netapp.ontap                  22.5.0
netapp.storagegrid            21.11.1
netapp.um_info                21.8.0
netapp_eseries.santricity     1.4.0
netbox.netbox                 3.12.0
ngine_io.cloudstack           2.3.0
ngine_io.exoscale             1.0.0
ngine_io.vultr                1.1.3
openstack.cloud               1.10.0
openvswitch.openvswitch       2.1.0
ovirt.ovirt                   2.4.1
purestorage.flasharray        1.17.2
purestorage.flashblade        1.11.0
purestorage.fusion            1.4.2
sensu.sensu_go                1.13.2
splunk.es                     2.1.0
t_systems_mms.icinga_director 1.32.2
theforeman.foreman            3.10.0
vmware.vmware_rest            2.3.1
vultr.cloud                   1.7.0
vyos.vyos                     4.0.2
wti.remote                    1.0.4

# /home/robert/.ansible/collections/ansible_collections
Collection                                Version
----------------------------------------- -------
ansible.posix                             1.5.2
community.general                         6.0.1
middleware_automation.amq                 1.3.2
middleware_automation.common              1.1.0
middleware_automation.redhat_csp_download 1.2.2

STEPS TO REPRODUCE

Install the playbook with the command ansible-playbook -e "activemq_version=7.10.2" -i hostfiles/AMQ-dev-shared-storage.yml playbooks/mount-nfs-install-broker.yml

all:
  children:
    amq:
      children:
        ha1:
          hosts: amq1
          vars:
            artemis: "amq1"
            node0: "amq2"
        ha2:
          hosts: amq2
          vars:
            artemis: "amq2"
            node0: "amq1"
      vars:
        iface: enp0s8
        activemq_configure_firewalld: True
        activemq_prometheus_enabled: False
        activemq_cors_strict_checking: False
        activemq_disable_hornetq_protocol: true
        activemq_disable_mqtt_protocol: true
        activemq_ha_enabled: true
        activemq_shared_storage: true
        activemq_shared_storage_path: /data/amq-broker/shared/mount
        ansible_user: ansible
        activemq_offline_install: True
        activemq_version: 7.10.2
        activemq_dest: /opt/amq
        activemq_archive: "amq-broker-{{ activemq_version }}-bin.zip"
        activemq_installdir: "{{ activemq_dest }}/amq-broker-{{ activemq_version }}"
        activemq_shared_storage_mounted: true
        activemq_port: 61616
        nfs_mount_source: "192.168.2.221:/"
        activemq_instance_username: amq-admin
        activemq_instance_password: activemq_instance_password
        activemq_sa_password: "amq-sa-password"
        activemq_testers_password: "amq-testers-password"
        activemq_address_settings:
        - match: "#"
          parameters:
            dead_letter_address: DLQ
            expiry_address: ExpiryQueue
            redelivery_delay: 2000
            max_size_bytes: -1
            message_counter_history_day_limit: 10
            max_delivery_attempts: -1
            max_redelivery_delay: 300000
            redelivery_delay_multiplier: 2
            address_full_policy: PAGE
            auto_create_queues: true
            auto_create_addresses: true
            auto_create_jms_queues: true
            auto_create_jms_topics: true 
        activemq_users:
        - user: "{{ activemq_instance_username }}"
          password: "{{ activemq_instance_password }}"
          roles: [ amq ]
        - user: "amq-application-sa"
          password: "{{ activemq_sa_password }}"
          roles: [ amq-sa ]
        - user: "amq-testers-sa"
          password: "{{ activemq_testers_password }}"
          roles: [ amq ]
        activemq_roles:
        - name: amq
          match: '#'
          permissions: [ createDurableQueue, deleteDurableQueue, createAddress, deleteAddress, consume, browse, send, manage ]   
        - name: amq-sa
          match: '#'
          permissions: [ createDurableQueue, deleteDurableQueue, createAddress, deleteAddress, consume, browse, send, manage ]   
        - name: amq-testers
          match: '#'
          permissions: [ createDurableQueue, deleteDurableQueue, createAddress, deleteAddress, consume, browse, send, manage ]   
        activemq_acceptors:
          - name: amqp
            bind_address: "0.0.0.0"
            bind_port: "{{ activemq_port }}"
            parameters:
              tcpSendBufferSize: 1048576
              tcpReceiveBufferSize: 1048576
              protocols: CORE,AMQP,OPENWIRE
              useEpoll: true
              verifyHost: False
        activemq_connectors:
        - name: artemis
          address: "{{ artemis }}"
          port: "{{ activemq_port }}"
          parameters:
            tcpSendBufferSize: 1048576
            tcpReceiveBufferSize: 1048576
            protocols: CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE
            useEpoll: true
            amqpMinLargeMessageSize: 102400
            amqpCredits: 1000
            amqpLowCredits: 300
            amqpDuplicateDetection: true
            supportAdvisory: False
            suppressInternalManagementObjects: False
        - name: node0
          address: "{{ node0 }}"
          port: "{{ activemq_port }}"
          parameters:
            tcpSendBufferSize: 1048576
            tcpReceiveBufferSize: 1048576
            protocols: CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE
            useEpoll: true
            amqpMinLargeMessageSize: 102400
            amqpCredits: 1000
            amqpLowCredits: 300
            amqpDuplicateDetection: true
            supportAdvisory: False
            suppressInternalManagementObjects: False

EXPECTED RESULTS

I expect to start both broker via systemctl during the palybook run

ACTUAL RESULTS

The second broker fails to start via systemctl. I could not find a specific reason why the broker would not start.

RUNNING HANDLER [middleware_automation.amq.activemq : Restart and enable instance amq-broker for activemq service] *******************************************************************************************************************************************************
changed: [amq1]
fatal: [amq2]: FAILED! => changed=false
  msg: |-
    Unable to start service amq-broker: Job for amq-broker.service failed because the control process exited with error code.
    See "systemctl status amq-broker.service" and "journalctl -xe" for details.

[ansible@amq2 ~]$ sudo journalctl -xe -n 200 -u amq-broker.service
--
-- Unit amq-broker.service has begun starting up.
May 05 12:59:00 amq2.test.local systemd[190182]: amq-broker.service: Executing: /opt/amq/amq-broker/bin/artemis-service start
May 05 12:59:00 amq2.test.local artemis-service[190182]: Starting artemis-service
May 05 12:59:01 amq2.test.local artemis-service[190182]: artemis-service is now running (190186)
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: Child 190182 belongs to amq-broker.service.
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: Control process exited, code=exited status=0
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: Got final SIGCHLD for state start.
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: Permission denied while opening PID file or potentially unsafe symlink chain, will now retry with relaxed checks: /opt/amq/amq-broker/data/artemis.pid
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: New main PID 190186 belongs to service, we are happy.
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: Main PID loaded: 190186
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: About to execute: /usr/bin/timeout 60 sh -c 'tail -n 15 -f /opt/amq/amq-broker/log/artemis.log | sed "/AMQ221001/ q" && /bin/sleep 10'
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: Forked /usr/bin/timeout as 190215
May 05 12:59:01 amq2.test.local systemd[1]: amq-broker.service: Changed start -> start-post
May 05 12:59:01 amq2.test.local systemd[190215]: amq-broker.service: Executing: /usr/bin/timeout 60 sh -c 'tail -n 15 -f /opt/amq/amq-broker/log/artemis.log | sed "/AMQ221001/ q" && /bin/sleep 10'
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Child 190215 belongs to amq-broker.service.
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Control process exited, code=exited status=124
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Got final SIGCHLD for state start-post.
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Changed start-post -> stop-sigterm
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Child 190217 belongs to amq-broker.service.
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Child 190186 belongs to amq-broker.service.
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Permission denied while opening PID file or potentially unsafe symlink chain, will now retry with relaxed checks: /opt/amq/amq-broker/data/artemis.pid
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Main process exited, code=exited, status=143/n/a
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit amq-broker.service has entered the 'failed' state with result 'exit-code'.
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Changed stop-sigterm -> failed
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Job amq-broker.service/start finished, result=failed
May 05 13:00:01 amq2.test.local systemd[1]: Failed to start amq-broker Apache ActiveMQ Service.
-- Subject: Unit amq-broker.service has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit amq-broker.service has failed.
--
-- The result is failed.
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Unit entered failed state.
May 05 13:00:01 amq2.test.local systemd[1]: amq-broker.service: Changed failed -> auto-restart

----------------------------
[ansible@amq2 ~]$ cat /etc/systemd/system/amq-broker.service
# Ansible managed
[Unit]
Description=amq-broker Apache ActiveMQ Service
After=network.target
RequiresMountsFor=/data/amq-broker/shared/mount

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/amq-broker
PIDFile=/opt/amq/amq-broker/data/artemis.pid
ExecStart=/opt/amq/amq-broker/bin/artemis-service start
ExecStop=/opt/amq/amq-broker/bin/artemis-service stop
SuccessExitStatus = 0 143
RestartSec = 120
Restart = on-failure
LimitNOFILE=102642
TimeoutSec=600
ExecStartPost=/usr/bin/timeout 60 sh -c 'tail -n 15 -f /opt/amq/amq-broker/log/artemis.log | sed "/AMQ221001/ q" && /bin/sleep 10'

[Install]
WantedBy=multi-user.target
---------------------------------
[ansible@amq2 ~]$ cat /etc/sysconfig/amq-broker
# Ansible managed
JAVA_ARGS='-Xms512M -Xmx2G -XX:+PrintClassHistogram -XX:+UseG1GC -XX:+UseStringDeduplication -Dhawtio.disableProxy=true -Dhawtio.realm=activemq -Dhawtio.offline=true -Dhawtio.rolePrincipalClasses=org.apache.activemq.artemis.spi.core.security.jaas.RolePrincipal -Djolokia.policyLocation=file:/opt/amq/amq-broker/etc/jolokia-access.xml'
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.19.0.7-1.el8_7.x86_64
HAWTIO_ROLE='amq'
ARTEMIS_INSTANCE_URI='file:/opt/amq/amq-broker/'
ARTEMIS_INSTANCE_ETC_URI='file:/opt/amq/amq-broker/etc/'
ARTEMIS_HOME='/opt/amq/amq-broker-7.10.2'
ARTEMIS_INSTANCE='/opt/amq/amq-broker'
ARTEMIS_DATA_DIR='/data/amq-broker/shared/mount'
ARTEMIS_ETC_DIR='/opt/amq/amq-broker/etc'

RobertFloor commented 1 year ago

This is the unit file that works in version 1.3.1

[ansible@amq2 system]$ cat amq-broker.service
# Ansible managed
[Unit]
Description=amq-broker Apache ActiveMQ Service
After=network.target
RequiresMountsFor=/data/amq-broker/shared/mount

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/amq-broker
PIDFile=/opt/amq/amq-broker/data/artemis.pid
ExecStart=/opt/amq/amq-broker/bin/artemis-service start
ExecStop=/opt/amq/amq-broker/bin/artemis-service stop
SuccessExitStatus = 0 143
RestartSec = 120
Restart = on-failure
LimitNOFILE=102642
TimeoutSec=600
ExecStartPost=/usr/bin/timeout 60 sh -c 'tail -f /opt/amq/amq-broker/log/artemis.log | sed "/AMQ221034/ q"'

[Install]
WantedBy=multi-user.target

The difference seems to be in the line ExecStartPost=/usr/bin/timeout 60 sh -c 'tail -f /opt/amq/amq-broker/log/artemis.log | sed "/AMQ221034/ q"'

[ansible@amq2 sysconfig]$ cat amq-broker
# Ansible managed
JAVA_ARGS='-Xms512M -Xmx2G -XX:+PrintClassHistogram -XX:+UseG1GC -XX:+UseStringDeduplication -Dhawtio.disableProxy=true -Dhawtio.realm=activemq -Dhawtio.offline=true -Dhawtio.rolePrincipalClasses=org.apache.activemq.artemis.spi.core.security.jaas.RolePrincipal -Djolokia.policyLocation=file:/opt/amq/amq-broker/etc/jolokia-access.xml'
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.19.0.7-1.el8_7.x86_64
HAWTIO_ROLE='amq'
ARTEMIS_INSTANCE_URI='file:/opt/amq/amq-broker/'
ARTEMIS_INSTANCE_ETC_URI='file:/opt/amq/amq-broker/etc/'
ARTEMIS_HOME='/opt/amq/amq-broker-7.10.2'
ARTEMIS_INSTANCE='/opt/amq/amq-broker'
ARTEMIS_DATA_DIR='/data/amq-broker/shared/mount'
ARTEMIS_ETC_DIR='/opt/amq/amq-broker/etc'

RobertFloor commented 1 year ago

Why is the log code changed to AMQ221001 from AMQ221034? I can confirm if I change it back to AMQ221034 fixes the problem

guidograzioli commented 1 year ago

Hello @RobertFloor ; with 1.3.2 master/backup shared store policy is implemented. Formerly two live-only master would race on the live lock (in the shared store). That is still the default, and in this config, you found the bug being the AMQ code; sorry abou that.

But if you wish you can switch to a proper master/backup setup, after you set activemq_ha_role: 'master' for one node, and activemq_ha_role: 'slave' for the other node.

IN that scenario, the systemd unit will respectively wait for:

AMQ221001: Apache ActiveMQ Artemis Message Broker version 2.21.0.redhat-00030 [amq-broker, nodeID=...] (for the master)
AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.21.0.redhat-00030 [..] started, waiting live to fail before it gets active (for the backup)

RobertFloor commented 1 year ago

Hi thanks for the answer and fixing. However I believe this behavior is not desired. Say we have an emergency and the master is down and stays down (OS corrupt, hardware failure or something). The new setting will mean that we would never able to manage the slave using systemctl in this case. It is not guaranteed that the slave will always be the backup. Sometimes the slave needs to be the active broker and we still would like to control it using systemctl (say systemctl restart on the slave if the master is down).

guidograzioli commented 1 year ago

The new setting will mean that we would never able to manage the slave using systemctl in this case.

I am not sure I follow your requirements here, the backup node would be managed same as before, just it will emit different logging when its master goes down and it starts picking up the connections. Anyway, the default as before the change will stay, both because it's the default for the artemis create command to not setup ha-policy when the ha role is not passed (meaning it defaults to the xsd 'live-only'), and for backwards compatibility. There are a few issues at the moment between github actions, molecule and docker in our CI, as soon as it gets back under control, merging the linked PR should fix this issue.

andytaylor commented 1 year ago

I can't answer for the Ansible side of things but I think you have your brokers mis-configured, you can't have two Live only brokers using shared store as a pair or even clustered with other brokers.. The choices you have are:

Clustered Masters

This is a group of master brokers all live that are clustered to distribute messages between them where each broker has its own Journal

HA Pairs (Shared store or replicated)

A master and a slave Broker where only the Master is live and in the shared store case share the same journal in a shared FS

Live Only

A single distinct broker with its own journal, that is no backup and no other brokers.

Of course you can also have a cluster of HA Pairs

Hope that helps

ansible-middleware / amq