Closed MrMEEE closed 5 years ago
hello, we have a successfull HA deployement thanks to your rpm.
Here is what we have done: rabbitmq clustering disable celery-beat service modify the celery-worker execstart command
we have made a lot of test on it and everything seems fine
That is great to hear... thanks for your feedback..
If you have a more detailed installation description, I would love to add it to the documentation..
ok so here is the process: install db on an external server with your install guide. install 1st awx server with your install guide (connect it to the db) install 2&3 awx server with your install guide (connect them to the DB and don't make those commands: echo "from django.contrib.auth.models import User; User.objects.create_superuser('admin', 'root@localhost', 'password')" | sudo -u awx /opt/awx/bin/awx-manage shell sudo -u awx /opt/awx/bin/awx-manage create_preload_data)
When all nodes are installed we can now build the rabbitmq cluster. Connect on node 1 an copy the erlang cookie to node 2 and 3 var/lib/rabbitmq/.erlang.cookie
Connect to nodes 2 and 3: restart app to make it see the new cookie rabbitmqctl stop_app rabbitmqctl start_app create rabbitmqctl cluster: rabbitmqctl stop_app rabbitmqctl join_cluster rabbit@node1 rabbitmqctl start_app set the HA policy rabbitmq-plugins enable rabbitmq_management rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}' systemctl restart rabbitmq-server
rabbitmq is now in cluster.
Second step is celery: first disable and stop celery beat on all server: Second modify the exec command of the celery services: etc/systemd/system/multi-user.target.wants/awx-celery-worker.service -> ExecStart=/opt/awx/bin/celery worker -A awx -l info --autoscale=50,4 -Ofair -Q tower_scheduler,tower,%(ENV_HOSTNAME)s -n celery@%(ENV_HOSTNAME)s restart celery services on all node.
We also saw that at this step it can be better to reboot all 3 nodes but one by one to kept the rabbitmq cluster in good shape.
hope that can help
i forgot but of course final step go to the web interface and create the instance with all 3 nodes
@MrMEEE , this a bit out of topic. But for those who wish to explore and automate the HA / Instance group using official docker stand alone method can be access it from my repository . Were you able to add this piece of info under your wiki, may be some people out there could be helpful. Thanks
@sujiar37
So, basically, everything that is needed for a HA setup is to do a standalone postgresql (cluster) and a rabbitmq (cluster)
and then do frontends that connects to these??
Should be pretty simple to implement..
I have added a links section
@MrMEEE , thank you for adding these piece of info under your wiki.
Only requirement is to setup a standalone postgresql and rest all will be taken care by playbook such as building and configuring the rabbitmq cluster and enabling the docker version of HA in all nodes. And Yes, it is pretty simple to implement now through my playbook
Hi guys, As we worked on it with @Aglidic to build the first HA implementation of the RPM, we have a playbook which does the full setup automatically. It's just corporate currently, so I need find some time to generalize it if you want to add it somewhere.
Best,
Tim.
@powertim
I would love to include playbooks for installing in the RPM...
OK so I'll add that to my TODO for the next days...
OK so I'll add that to my TODO for the next days...
Did you had a chance to do it? I would love to try them.
Hi all, very much interested in the playbook. If playbook is not available now, can someone highlight what's needed for pointing to external Postgres server - using the RPM installation method please. Something like this -
pg_hostname=hostname pg_username=awx pg_password=xxxxx pg_database=awx pg_port=5432
Thanks everyone for great efforts!
In regards to the external postgres, you basically only needs to setup an external postgres (cluster?) And change the configuration in /etc/tower/settings.py to point to that server, before running the database initialization..
Thanks much for the quick response . Yes it will be a Postgres 2-node cluster with steaming replication . Yes I see there’s a section for configuring USER/PASSWORD/HOST/PORT in settings.py. So Initializing DB / all steps listed in awx.wiki/multi-section-page/configuration is still required?
No issues setting up external Postgresdb . Issues with setting up cluster . I followed the previous comments on setting up clustering and got to point of enabling rabbitmq cluster within 2 nodes - but awx didn’t detect the additional node . The endpoint - api/v2/Ping only displays one activenode. Also there’s no awx-celery-worker - this service appears to have been deprecated? Thanks.
Hi..
I think you have to enable each of the awx nodes with the command:
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage register_queue --queuename=tower --hostnames=$(hostname)"
and yes, the celery worker is deprecated...
Thanks much it worked! I had to run this command first - sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage provision_instance --hostname=$(hostname)"
before running your command - sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage register_queue --queuename=tower --hostnames=$(hostname)"
Also as far as upgrading to latest AWX version, I presume it will still work just so that we have to upgrade on all Nodes within the cluster. Thanks again!
Ah, yes.. of course you have to do the provision_instance first :)..
I will do a write-up on this and put on awx.wiki as soon as possible.. also i'm planning a setup-tools for simpler installation and configuration, which will also contain the HA... Could you share the exact changes you have made to the systemd files??
Remember not to change the files themselves, but to overwrite them with copies in /etc/systemd/system.. else they will get returned to default on the next update...
In regards to updating, I think you should update the ansible-awx (and dependencies) on all nodes before running the database migrations...
I said it too fast :) Yes, both nodes are within the cluster. For some reason jobs couldn't execute on the newly added node. When attempted to run a job against new node - it goes to a "Wait" state before timing out with the message - Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed..
Tried - rabbitmqctl stop_app/ rabbitmqctl start_app , systemctl restart rabbitmq-server on the server. Also bounce both nodes. On the web GUI, I had switched from OFF to ON, but USED CAPACITY eventually becomes "UNAVAILABLE."
ignore - the issue was with AWX not running during startup on the new node:) still doing more testing. Thanks again..
so far so good . I didn't make any systemD changes since celery has been deprecated.
one issue came up so far is that when a job finished running on new node, the node USED CAPACITY goes into "UNAVAILABLE." It is as though the node lost its heartbeat to the rabbitmq cluster. Need to troubleshoot further.
I'm in Praque for the week for a Red Hat event.. I will try to setup a HA environment when I get home, then we can debug together
This is error msg I'm getting.
2019-06-27 14:01:34.390 [info] <0.1498.0> connection <0.1498.0> (127.0.0.1:42950 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/' 2019-06-27 14:01:44.556 [warning] <0.1498.0> closing AMQP connection <0.1498.0> (127.0.0.1:42950 -> 127.0.0.1:5672, vhost: '/', user: 'guest'): client unexpectedly closed TCP connection
Thanks
Looks like the issue has to do with the fact - awx requires 'tower' vhost. Currently we're using host (default) '/'. So getting a bunch of closing AMQP connections.
2019-06-27 15:30:29.289 [info] <0.2130.0> connection <0.2130.0> (127.0.0.1:47012 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/' 2019-06-27 15:30:49.603 [info] <0.2139.0> accepting AMQP connection <0.2139.0> (127.0.0.1:47390 -> 127.0.0.1:5672) 2019-06-27 15:30:49.610 [info] <0.2139.0> connection <0.2139.0> (127.0.0.1:47390 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/' 2019-06-27 15:30:49.619 [info] <0.2139.0> closing AMQP connection <0.2139.0> (127.0.0.1:47390 -> 127.0.0.1:5672, vhost: '/', user: 'guest') 2019-06-27 15:30:49.687 [info] <0.2150.0> accepting AMQP connection <0.2150.0> (127.0.0.1:47394 -> 127.0.0.1:5672) 2019-06-27 15:30:49.695 [info] <0.2150.0> connection <0.2150.0> (127.0.0.1:47394 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/' 2019-06-27 15:30:49.710 [info] <0.2150.0> closing AMQP connection <0.2150.0> (127.0.0.1:47394 -> 127.0.0.1:5672, vhost: '/', user: 'guest') 2019-06-27 15:30:49.748 [info] <0.2161.0> accepting AMQP connection <0.2161.0> (127.0.0.1:47396 -> 127.0.0.1:5672) 2019-06-27 15:30:49.755 [info] <0.2161.0> connection <0.2161.0> (127.0.0.1:47396 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/' 2019-06-27 15:30:49.771 [info] <0.2161.0> closing AMQP connection <0.2161.0> (127.0.0.1:47396 -> 127.0.0.1:5672, vhost: '/', user: 'guest')
@dnc92301 Let's move the discussion to #121
Hi guys, As we worked on it with @Aglidic to build the first HA implementation of the RPM, we have a playbook which does the full setup automatically. It's just corporate currently, so I need find some time to generalize it if you want to add it somewhere.
Best,
Tim.
Hi guys,
Finally the playbook is here: https://github.com/powertim/deploy_awx-rpm Currently designed for RHEL7 x86_64 with Satellite repos. I will try to update it with manual repos as described on https://awx.wiki/installation/repositories/rhel7-x86_64. And why not in the future for the different OS supported on the awx.wiki...
Please try first to adapt the playbook before opening an issue. I'll fill up the README soon.
Best,
Tim
Hi Tim,
Good to see the repo. Waiting for the README. Wondering will it work on centos7 as well?
Best regards, Gowtham 07798838879
===================== Learn from mistakes....
Please consider the environment before printing this email - Thanks
On Tue, Jul 23, 2019 at 3:37 PM Timothée Christin notifications@github.com wrote:
Hi guys, As we worked on it with @Aglidic https://github.com/Aglidic to build the first HA implementation of the RPM, we have a playbook which does the full setup automatically. It's just corporate currently, so I need find some time to generalize it if you want to add it somewhere.
Best,
Tim.
Hi guys,
Finally the playbook is here: https://github.com/powertim/deploy_awx-rpm Currently designed for RHEL7 x86_64 with Satellite repos. I will try to update it with manual repos as described on https://awx.wiki/installation/repositories/rhel7-x86_64. And why not in the future for the different OS supported on the awx.wiki...
Please try first to adapt the playbook before opening an issue. I'll fill up the README soon.
Best,
Tim
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MrMEEE/awx-build/issues/26?email_source=notifications&email_token=AA66HLTARJAVF4QGFEYCAXTQA4JTFA5CNFSM4FDCQXP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2TKQDI#issuecomment-514238477, or mute the thread https://github.com/notifications/unsubscribe-auth/AA66HLXIP2J22MFX5W3L7EDQA4JTFANCNFSM4FDCQXPQ .
Hi @gowthamakanthan ,
It should work on CentOS 7 with a few changes:
Add local repos with module 'yum_repository' instead of the Satellite repos I'm using with module 'rhsm_repository' in files roles/db_prereqs/tasks/main.yml & roles/nodes_prereqs/tasks/main.yml.
Maybe change the line #26 of file roles/nodes_prereqs/tasks/main.yml to succeed installation of dependencies.
But I'll try to add this content when I find the time for that (I hope quickly).
@powertim - Thanks for the efforts! I've tested and works as expected. However, the previous reported issue still exists where the 2nd node (I have a 2 node rabbit mq cluster) goes into "UNAVAILABLE" state as soon as Job finished running. hostnameB is the 2nd node which has a Capacity of 0 because it's NOT availalbe. Primary node I've DISABLED it intentionally.
[root@hostnameA deploy_awx-rpm]# sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage list_instances"
[tower capacity=0] hostnameB capacity=0 version=6.1.0 [DISABLED] hostnameA capacity=0 version=6.1.0
This is installed using latest AWX 6.10. This is example of a run where node becomes "unavailable" where job no longer exists in the queue - with the below explanation.
EXPLANATION Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed. STARTED 7/26/2019 1:18:34 PM FINISHED 7/26/2019 1:20:25 PM
Hi all, It looks like the problem is no longer servicing after setting up a new server . However, I'm hitting the following issues with starting up awx.
Issue with - scl: RuntimeError: Django version other than 2.2.2 detected: 2.2.4.
Django is what comes by default - rh-python36-Django-2.2.4-1.noarch
Thanks.
Aug 6 18:40:19 hostnameA scl: Traceback (most recent call last):
Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/bin/daphne", line 11, in names_digest
is known to work for Django 2.2.2 and may not work in other Django versions.
Aug 6 18:40:19 hostnameA systemd: awx-daphne.service: main process exited, code=exited, status=1/FAILURE
Aug 6 18:40:19 hostnameA systemd: Unit awx-daphne.service entered failed state.
Aug 6 18:40:19 hostnameA systemd: awx-daphne.service failed.
Aug 6 18:40:21 hostnameA systemd: awx-cbreceiver.service holdoff time over, scheduling restart.
Aug 6 18:40:21 hostnameA systemd: awx-channels-worker.service holdoff time over, scheduling restart.
Aug 6 18:40:21 hostnameA systemd: awx-dispatcher.service holdoff time over, scheduling restart.
Aug 6 18:40:21 hostnameA systemd: Stopped AWX Dispatcher.
Aug 6 18:40:21 hostnameA systemd: Stopped AWX channels worker service.
Aug 6 18:40:21 hostnameA systemd: Stopping AWX web service...
Aug 6 18:40:21 hostnameA systemd: Stopped AWX cbreceiver service.
Aug 6 18:40:21 hostnameA systemd: awx-daphne.service holdoff time over, scheduling restart.
Aug 6 18:40:21 hostnameA systemd: Stopped AWX daphne service.
[root@hostnameA ~]#
Did you install your cluster using the playbook? I had this issue when relaunching the playbook. Should be ok with a full clean install.
Cheers,
Tim.
Le mar. 6 août 2019 à 20:43, dnc92301 notifications@github.com a écrit :
Hi all, It looks like the problem is no longer servicing after setting up a new server . However, I'm hitting the following issues with starting up awx.
Issue with - scl: RuntimeError: Django version other than 2.2.2 detected: 2.2.4.
Django is what comes by default - rh-python36-Django-2.2.4-1.noarch
Thanks.
Aug 6 18:40:19 hostnameA scl: Traceback (most recent call last): Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/bin/daphne", line 11, in Aug 6 18:40:19 hostnameA scl: load_entry_point('daphne==1.3.0', 'console_scripts', 'daphne')() Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/daphne/cli.py", line 144, in entrypoint Aug 6 18:40:19 hostnameA scl: cls().run(sys.argv[1:]) Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/daphne/cli.py", line 174, in run Aug 6 18:40:19 hostnameA scl: channel_layer = importlib.import_module(module_path) Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/lib64/python3.6/importlib/init.py", line 126, in import_module Aug 6 18:40:19 hostnameA scl: return _bootstrap._gcd_import(name[level:], package, level) Aug 6 18:40:19 hostnameA scl: File "", line 994, in _gcd_import Aug 6 18:40:19 hostnameA scl: File "", line 971, in _find_and_load Aug 6 18:40:19 hostnameA scl: File "", line 941, in _find_and_load_unlocked Aug 6 18:40:19 hostnameA scl: File "", line 219, in _call_with_frames_removed Aug 6 18:40:19 hostnameA scl: File "", line 994, in _gcd_import Aug 6 18:40:19 hostnameA scl: File "", line 971, in _find_and_load Aug 6 18:40:19 hostnameA scl: File "", line 955, in _find_and_load_unlocked Aug 6 18:40:19 hostnameA scl: File "", line 665, in _load_unlocked Aug 6 18:40:19 hostnameA scl: File "", line 678, in exec_module Aug 6 18:40:19 hostnameA scl: File "", line 219, in _call_with_frames_removed Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/awx/init.py", line 49, in Aug 6 18:40:19 hostnameA scl: current=django.version) Aug 6 18:40:19 hostnameA scl: RuntimeError: Django version other than 2.2.2 detected: 2.2.4. Overriding names_digest is known to work for Django 2.2.2 and may not work in other Django versions. Aug 6 18:40:19 hostnameA systemd: awx-daphne.service: main process exited, code=exited, status=1/FAILURE Aug 6 18:40:19 hostnameA systemd: Unit awx-daphne.service entered failed state. Aug 6 18:40:19 hostnameA systemd: awx-daphne.service failed. Aug 6 18:40:21 hostnameA systemd: awx-cbreceiver.service holdoff time over, scheduling restart. Aug 6 18:40:21 hostnameA systemd: awx-channels-worker.service holdoff time over, scheduling restart. Aug 6 18:40:21 hostnameA systemd: awx-dispatcher.service holdoff time over, scheduling restart. Aug 6 18:40:21 hostnameA systemd: Stopped AWX Dispatcher. Aug 6 18:40:21 hostnameA systemd: Stopped AWX channels worker service. Aug 6 18:40:21 hostnameA systemd: Stopping AWX web service... Aug 6 18:40:21 hostnameA systemd: Stopped AWX cbreceiver service. Aug 6 18:40:21 hostnameA systemd: awx-daphne.service holdoff time over, scheduling restart. Aug 6 18:40:21 hostnameA systemd: Stopped AWX daphne service. [root@hostnameA ~]#
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MrMEEE/awx-build/issues/26?email_source=notifications&email_token=AB5CGUAI2Q2TP7DI463EVALQDHA3VA5CNFSM4FDCQXP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3WCZMI#issuecomment-518794417, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5CGUAIKH4KQ53MR2YAV23QDHA3VANCNFSM4FDCQXPQ .
@dnc92301 Please create new issues, instead of reusing old ones...
Have you remembered to update the ansible-awx package???
@powertim Maybe the playbook doesn't update the ansible-awx package??
@tim - yes this happens after rerunning playbook. After upgrading to latest ansible-awx version it worked!
@powertim Maybe the playbook doesn't update the ansible-awx package??
It's updated now ! See commit df571c0
@tim - yes this happens after rerunning playbook. After upgrading to latest ansible-awx version it worked!
Yeah unfortunately re-running playbook cause failures. I need to improve that.
Hello, I have offline VMs where I need to build AWX. As listed above I saw about 160 rh-python36- dependencies. Where I can find a tar ball or url for all rpms I need for AWX? Not using docker, plan to use RHEL7 VMs to create HA. But I'm lost to collect all rh-python36- from mirror sites one by one , will appreciate if I know what order which rpm needs to get installed. Thanks.
@VJoshi0: yum install --download-only --download-dir /to/here/ rh-python36-*
Further example at https://unix.stackexchange.com/questions/259640/how-to-use-yum-to-get-all-rpms-required-for-offline-use
So after having the 3 instances clustered is a loadbalancer used at all?
What about manual projects that are on the local filesystem? Rsync them?
Hi All thanks for the great work that you are doing, I was wondering if there was a step by step guide for the HA setup similar to the standalone setup in this wiki guide https://awx.wiki/installation/installation
Hi @elstoncawley,
Unfortunately not, and I didn't work for a long time on the HA setup but you'll find steps in the playbook here https://github.com/powertim/deploy_awx-rpm. Roles names should help you to find the steps for building the cluster.
Cheers,
Tim.
Thanks @powertim I am actually installing on a CentOS 7 server and was wondering about the repo. In the vars/nodes.yml file. Could I use https://awx.wiki/repository/ for the awx_repo variable?
Yes in theory you can use all the repos you want but you need to change the way you enable repos and call them because I only provided a RHEL conf with Satellitte so subscription-manager command won't be available for you.
Hi everbody!
Did anyone get HA/Clustering running with AWX 11.X.X and redis?
Hi everbody!
Did anyone get HA/Clustering running with AWX 11.X.X and redis?
Responding to myself, and leaving reference material for those who need it, issue of https://github.com/sujiar37/AWX-HA-InstanceGroup/issues/26 seems to shed some light.
I will test asap.
https://github.com/fitbeard/awx-ha-cluster this playbook working well. I'm using it for a time.
Moved from here: https://github.com/subuk/awx-rpm/issues/11