chrismeyersfsu commented 4 years ago

ISSUE TYPE

Feature Idea

SUMMARY

Replace our clustered implementation of RabbitMQ with something that is easier to understand and operate (and that matches AWX's needs better).

AWX currently makes extensive use of clustered RabbitMQ:

As a form of direct topic-based RPC for dispatching jobs (e.g., playbook runs) to underlying AWX instances. This process involves a periodic scheduler that wakes up, finds work to do, picks an available node with capacity, and places a message on its queue, which is treated as a sort of per-instance "task queue" ala https://python-rq.org or https://docs.celeryproject.org/en/stable/. Certain special messages (which generally are used to perform internal housekeeping tasks in AWX) are "broadcast" to all nodes instead of following a direct RPC topology.
As a buffer for processing job output (Ansible callback events/stdout) via AWX's "callback receiver" process running on each AWX instance.
As a backend for AWX's websocket support for in-browser streaming stdout and live job status updates. Our websocket implementation is based on a custom AMQP-specific ASGI backend which we wrote and maintain, https://github.com/ansible/asgi_amqp/. As time has marched on, and the upstream channels library has drastically changed its architecture in anticipation of native async support in python3, it has become an increased maintenance burden for us to continue to support a custom backend specific to AMQP (especially when it appears that pretty much everybody upstream that uses Channels is just using Redis).

When we originally designed this system years ago, we optimized as heavily as possible for data integrity and safety. But in the scenarios described above, the data we manage under this system is largely ephemeral. In the most extreme cases, it doesn't persist beyond the lifetime of a running playbook. In other words, if a node running a playbook were to suddenly go offline, we can't really recover from that sort of scenario anyways without re-running the playbook. Similarly, if messages are lost in flight in rare circumstances, you can always just relaunch a playbook.

We're paying a heavy cost for this cluster-wide data mirroring/replication. Historically, we've heard from many of our users that:

RabbitMQ clustering doesn't work well in environments unless cluster peers have very low latency. In fact, this is a limitation called out repeatedly in RabbitMQ's clustering documentation. It's an aspect of RabbitMQ clustering that we knew about when we chose it years ago, but it's turned out to be much more painful than we anticipated.
Especially in environments with unreliable networks, RabbitMQ can be very difficult to administer and troubleshoot. In particular, we regularly have users that report network partitioning scenarios that require manual intervention via manual erlang and/or RabbitMQ-specific remediation.

When cluster nodes disappear for prolonged periods of time (hours, days), we've seen many situations where RabbitMQ clustering just isn't able to recover on its own, which causes a myriad of issues when the node returns. Detecting and remediating this often leads to service outages.
The firewall/security group requirements for inter-node replication is a common source of confusion for users, and failing to do it properly can result in situations where adding a node to an existing cluster fails and results in an unanticipated cluster-wide outage.

What we've come to realize is that this architecture is likely not worth the operational and architectural cost we're paying.

Long-term, we'd prefer to move to a model that does not require a control plane that relies on a clustered message bus, but instead one where members of the control plane can largely drop off with minimal effect beyond lowered total execution capacity. RabbitMQ clustering explicitly is not reliable across AZs, and especially not regions, and while newer topologies we're considering don't absolve of this entirely, our goal is to move AWX to a model which is much more forgiving of low-latency networks in general.

In the next major version of AWX, we'd like to investigate replacing RabbitMQ with a combination of features provided by Redis (a new dependency) and Postgres itself. This would most likely look something like this:

Dispatching tasks is still treated as “direct RPC”. In other words, when the task manager runs, it picks one cluster node with capacity, and assigns it as the “execution node”. Dispatcher processes running on every node listen for “tasks” via PostgreSQL channel notification support (https://www.postgresql.org/docs/10/sql-notify.html)
Events emitted from playbooks are no longer sent to a distributed message queue (previously RabbitMQ), but instead a local redis running on each node. Callback receivers on each node listen for events on that node and persist them into the database.
When an event is persisted to the database by the callback receiver, it also is broadcasted to all cluster peers via ASGI. In this way, if a playbook runs on Node A, users connected to Daphne on Nodes B, C, and D will receive a broadcast of these events and see the output in their browser tabs.

Longer term, introducing Redis would potentially allow us to also lose our dependence on memcached (so in other words, we might be able to swap out two dependencies, and replace them with one single new dependency).

kdelee commented 4 years ago

some additional work items under this:

[ ] include an awx-manage based health check for the redis system
[ ] include ^ health check in the sos report as well as find way to depend on/enable the redis sos report https://github.com/ansible/awx/blob/devel/tools/sosreport/tower.py so we can get redis logs in sos report
[ ] send one final unsubscribed message back to ws client when tower ACKs the unsubscribe request so we can know when we have actually been unsubscribed

ryanpetrello commented 4 years ago

cc @MrMEEE in case you haven't seen this yet

also: https://groups.google.com/forum/#!topic/awx-project/lRnm2vB1oEQ

MrMEEE commented 4 years ago

@ryanpetrello Thanks for the heads-up.. I will follow this closely :)

ryanpetrello commented 4 years ago

@MrMEEE the biggest change is "install and configure Redis, not RabbitMQ". Also, we lost a number of RabbitMQ toggle-ables in the installer.

You may be interested in any changes under ./installer in the PR:

https://github.com/ansible/awx/pull/6034/files#diff-bfa9126dc8059138bf7554d741cb6a5d https://github.com/ansible/awx/pull/6034/files#diff-fabe539e09ace3de67486bba9b5b3be6 https://github.com/ansible/awx/pull/6034/files#diff-0091f8a83b63dafea8313c794ba726b3

elyezer commented 4 years ago

Extensive testing was done before merge to ensure the installation was working as expected and that replacing rabbitmq with redis would not introduce regressions.

With that said, we can consider this as being verified and any further polishing will be handled by separated issues (we already got some of those already opened).

ryanpetrello commented 4 years ago

Just a heads up @MrMEEE - 10.0.0 is out now, and includes this change.

MrMEEE commented 4 years ago

@ryanpetrello thanks.. A completely new build platform, CentOS8/RHEL8 support and the Redis changes are in the works.. I hope for a release after easter

aak1989 commented 4 years ago

Hey there, I installed AWX on kubernetes after redis was introduced , the installation compled with no issue but when i access the UI and try to do anything on the UI i get error related to api. i am attaching couple of screenshot of the error i am getting.

aak1989 commented 4 years ago

I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed.

ryanpetrello commented 4 years ago

@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the awx_web logs?

ryanpetrello commented 4 years ago

Also, could you file a new issue describing what you're encountering? Thanks.

rkatta22 commented 4 years ago

I have installed 10.0.0 version of AWX and that has resolved the above issue, however the AWX UI no longer appears to refresh automatically, For example, starting a job, the job always remains at pending unless you manually refresh the page. The job appears to stay in pending until the browser is manually refreshed.

Hi I got same error when I am upgrading ansible tower from 7.0.0 to 11.0.0 through Docker-Compose file, now when I run the docker-compose up command, I am getting below error. please help me what is the mistake I am doing hear in the configuration.

ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://) task_1 | 2020-05-09 11:00:41,341 INFO exited: callback-receiver (exit status 1; not expected) task_1 | 2020-05-09 11:00:42,344 INFO spawned: 'callback-receiver' with pid 1276 task_1 | 2020-05-09 11:00:43,345 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) web_1 | 2020-05-09 11:00:43,549 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp') task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282 task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282 task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283 task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283 task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284 task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284 task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285 task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285 task_1 | Traceback (most recent call last):

below is my compose file: version: '2' services:

web: image: ansible/awx_web:11.0.0 container_name: awx_web depends_on:

redis
memcached ports:
"80:8052"
"443:8443" hostname: awxweb user: root restart: unless-stopped volumes:
"/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
"/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
"/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
"/var/lib/awx/projects:/var/lib/awx/projects:rw"
"/var/lib/awx/projects/nginx.conf:/etc/nginx/nginx.conf:rw"

dns:
10.204.226.77
10.204.226.111 environment: http_proxy: https_proxy: no_proxy:

task: image: ansible/awx_task:11.0.0 container_name: awx_task depends_on:
redis
memcached
web hostname: awx user: root restart: unless-stopped volumes:
"/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
"/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
"/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
"/var/lib/awx/projects:/var/lib/awx/projects:rw" dns:
10.204.226.77
10.204.226.111 environment: http_proxy: https_proxy: no_proxy: redis: image: redis:6.0-rc4-alpine3.11 container_name: tools_redis_1 environment: REDIS_PASSWORD: password ports:
"6379:6379" volumes:
"/var/lib/awx/redis.conf:/usr/local/etc/redis/redis.conf"
"/var/lib/awx/redis_socket_standalone:/var/run/redis/" command: ["/usr/local/etc/redis/redis.conf"] memcached: image: "memcached:alpine" container_name: awx_memcached restart: unless-stopped environment: http_proxy: https_proxy: no_proxy:

and In environment.sh file I have done the below configuration: REDIS_URL=redis://ansible-ro.rbkm0e.ng.0002.use1.cache.amazonaws.com:6379 REDIS_PORT=6379 REDIS_SOCKET=/var/lib/awx/redis.sock REDIS_PASSWORD=password

rkatta22 commented 4 years ago

@aak1989 you've got HTTP 500 errors - can you share any errors you might see in the awx_web logs?

ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://) task_1 | 2020-05-09 11:00:41,341 INFO exited: callback-receiver (exit status 1; not expected) task_1 | 2020-05-09 11:00:42,344 INFO spawned: 'callback-receiver' with pid 1276 task_1 | 2020-05-09 11:00:43,345 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) web_1 | 2020-05-09 11:00:43,549 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp') task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282 task_1 | 2020-05-09 11:00:43,977 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1282 task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283 task_1 | 2020-05-09 11:00:43,982 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1283 task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284 task_1 | 2020-05-09 11:00:43,986 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1284 task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285 task_1 | 2020-05-09 11:00:43,991 WARNING awx.main.commands.run_callback_receiver scaling up worker pid:1285 task_1 | Traceback (most recent call last):

tima commented 4 years ago

@rkatta22 making comments on a closed ticket is not going to receive a reply. You need to file a new issue if you think you've encountered a bug.

ryanpetrello commented 4 years ago

@rkatta22 you haven't encountered a bug - you just have old configuration of some sort laying around pointed at an old AMQP connection string from a prior install (which is no longer valid):

('Unsupported URI scheme', 'amqp')

If you need help troubleshooting an AWX install, try our mailing list or IRC channel:

http://webchat.freenode.net/?channels=ansible-awx https://groups.google.com/forum/#!forum/awx-project

rkatta22 commented 4 years ago

Hi Team I am working on ansible tower upgrade from 7 to 10, I can see the containers are up but in awx web contaner log I see below error, could some one please help me.

task_1 | Traceback (most recent call last): task_1 | File "/usr/bin/awx-manage", line 8, in task_1 | sys.exit(manage()) task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/init.py", line 152, in manage task_1 | execute_from_command_line(sys.argv) task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 381, in execute_from_command_line task_1 | utility.execute() task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 375, in execute task_1 | self.fetch_command(subcommand).run_from_argv(self.argv) task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv task_1 | self.execute(*args, **cmd_options) task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute task_1 | output = self.handle(*args, options) task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_callback_receiver.py", line 26, in handle task_1 | consumer.run() task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 119, in run task_1 | queue = redis.Redis.from_url(settings.BROKER_URL) task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/client.py", line 673, in from_url task_1 | connection_pool = ConnectionPool.from_url(url, db=db, kwargs) task_1 | File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/connection.py", line 1046, in from_url task_1 | 'schemes (%s)' % valid_schemes) task_1 | ValueError: Redis URL must specify one of the followingschemes (redis://, rediss://, unix://) task_1 | 2020-05-13 13:01:44,167 INFO exited: callback-receiver (exit status 1; not expected) task_1 | 2020-05-13 13:01:45,169 INFO spawned: 'callback-receiver' with pid 907 task_1 | 2020-05-13 13:01:46,171 INFO success: callback-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) web_1 | 2020-05-13 13:01:46,669 WARNING awx.main.analytics.broadcast_websocket ('Unsupported URI scheme', 'amqp')

----below is my docker-compose file----> version: '2' services:

web: image: ansible/awx_web:10.0.0 container_name: awx_web depends_on:

redis
memcached ports:
"80:8052"
"443:8443" hostname: awxweb user: root restart: unless-stopped volumes:
"/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
"/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
"/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
"/var/lib/awx/projects:/var/lib/awx/projects:rw"
"/var/lib/awx/projects/nginx.conf:/etc/nginx/nginx.conf:rw"

dns:
10.204.226.77
10.204.226.111 environment: http_proxy: https_proxy: no_proxy:

task: image: ansible/awx_task:10.0.0 container_name: awx_task depends_on:
redis
memcached
web hostname: awx user: root restart: unless-stopped volumes:
"/var/lib/awx/SECRET_KEY:/etc/tower/SECRET_KEY"
"/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
"/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py"
"/var/lib/awx/projects:/var/lib/awx/projects:rw" dns:
10.204.226.77
10.204.226.111 environment: http_proxy: https_proxy: no_proxy: redis: image: redis:6.0-rc4-alpine3.11 container_name: tools_redis_1 environment: REDIS_PASSWORD: password ports:
"6379:6379" volumes:
"/var/lib/awx/redis.conf:/usr/local/etc/redis/redis.conf"
"/var/lib/awx/redis_socket_standalone:/var/run/redis/"
"/var/lib/awx/environment.sh:/etc/tower/conf.d/environment.sh"
"/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py" command: ["/usr/local/etc/redis/redis.conf"] memcached: image: "memcached:alpine" container_name: awx_memcached restart: unless-stopped environment: http_proxy: https_proxy: no_proxy:

below is my credentials.py file configuration----> DATABASES = { 'default': { 'ATOMIC_REQUESTS': True, 'ENGINE': 'django.db.backends.postgresql', 'NAME': "awx", 'USER': "awx", 'PASSWORD': "awxpass1", 'HOST': "awx-tower-upgrade.cnectdraqndy.us-east-1.rds.amazonaws.com", 'PORT': "5432", } }

BROKER_URL = 'amqp://{}:{}@{}:{}/{}'.format( "guest", "awxpass", "redis", "5672", "awx")

CHANNEL_LAYERS = { 'default': {'BACKEND': 'asgi_amqp.AMQPChannelLayer', 'ROUTING': 'awx.main.routing.channel_routing', 'CONFIG': {'url': BROKER_URL}} }

CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache', 'LOCATION': '{}:{}'.format("memcached", "11211") }, 'ephemeral': { 'BACKEND': 'django.core.cache.backends.locmem.LocMemCache', }, }

below is my environment.sh file ---> DATABASE_USER=awx DATABASE_NAME=awx DATABASE_HOST=awx-tower-upgrade.cnectdraqndy.us-east-1.rds.amazonaws.com DATABASE_PORT=5432 DATABASE_PASSWORD=awxpass1 MEMCACHED_HOST=memcached MEMCACHED_PORT=11211 RABBITMQ_HOST=rabbitmq RABBITMQ_PORT=5672 AWX_ADMIN_USER=admin AWX_ADMIN_PASSWORD=password

ANSIBLE_REDIS_HOST=ansible-tower.rbkm0e.ng.0001.use1.cache.amazonaws.com:6379

REDIS_URL="redis://ansible-tower-ro.rbkm0e.ng.0001.use1.cache.amazonaws.com:6379" REDIS_PORT=6379 REDIS_SOCKET=/var/lib/awx/redis.sock REDIS_PASSWORD=password

ryanpetrello commented 4 years ago

@rkatta22,

If you need help troubleshooting an AWX install, try our mailing list or IRC channel:

http://webchat.freenode.net/?channels=ansible-awx https://groups.google.com/forum/#!forum/awx-project

ansible / awx

Replace clustered RabbitMQ with something simpler #5443

ISSUE TYPE

SUMMARY

"/var/lib/awx/credentials.py:/etc/tower/conf.d/credentials.py" command: ["/usr/local/etc/redis/redis.conf"] memcached: image: "memcached:alpine" container_name: awx_memcached restart: unless-stopped environment: http_proxy: https_proxy: no_proxy:

CACHES = { 'default': { 'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache', 'LOCATION': '{}:{}'.format("memcached", "11211") }, 'ephemeral': { 'BACKEND': 'django.core.cache.backends.locmem.LocMemCache', }, }

ANSIBLE_REDIS_HOST=ansible-tower.rbkm0e.ng.0001.use1.cache.amazonaws.com:6379