Closed geerlingguy closed 4 years ago
Strangely, it seems the AWX/Tower installer is using a shared volume for Redis instead of communicating via a service/port on TCP...?
Ah, maybe it gets configured via the BROKER_URL
.
It seems like for the tower image, it used to be in Quay (quay.io/ansible-tower/ansible-tower), but the official Tower OpenShift installer now lists it at the Red Hat Registry (registry.redhat.io/ansible-tower-37/ansible-tower-rhel7)... which requires a valid Red Hat subscription and your cluster to be tied in/authenticated to be able to pull the images.
It's a bit annoying, but I guess the intention may be to not run Tower on non-OpenShift Kubernetes clusters? Or maybe someone just forgot to run the job to push the tower images out to Quay.io.
For now I'll push up my working branch (with the redis changes) since it could help with getting at least the latest AWX versions supported.
Hmm... getting:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 122, in run
res = queue.blpop(self.queues)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/client.py", line 1865, in blpop
return self.execute_command('BLPOP', *keys)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/client.py", line 875, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/connection.py", line 1185, in get_connection
connection.connect()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/connection.py", line 557, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to example-tower-redis.example-tower.svc.cluster.local:6379. Connection refused.
I tried connecting with a debug container:
$ kubectl run redis-cli --rm -n example-tower -it --image=goodsmileduck/redis-cli
/ # redis-cli -h example-tower-redis.example-tower.svc.cluster.local -p 6379 ping
Could not connect to Redis at example-tower-redis.example-tower.svc.cluster.local:6379: Connection refused
So then I checked the redis container logs:
$ kubectl logs -n example-tower example-tower-redis-6d5c655f9f-h495j
1:C 26 May 2020 20:30:40.828 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 26 May 2020 20:30:40.828 # Redis version=6.0.3, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 26 May 2020 20:30:40.828 # Configuration loaded
1:M 26 May 2020 20:30:40.829 * Running mode=standalone, port=0.
1:M 26 May 2020 20:30:40.829 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 26 May 2020 20:30:40.829 # Server initialized
1:M 26 May 2020 20:30:40.829 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 26 May 2020 20:30:40.830 * The server is now ready to accept connections at /var/run/redis/redis.sock
So it looks like it's only listening on the socket, and not on TCP...
Got that fixed by switching redis to run on TCP port only, but now I'm getting the following when I try to launch a job from a template:
Call to /api/v2/job_templates/7/launch failed. POST returned status: 500. A server error has occurred.
Error in task container:
2020-05-26 21:12:35,014 ERROR awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.run_task_manager(*[], **{}
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 67, in _make_backend
backend_class = import_string(self.configs[name]["BACKEND"])
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/utils/module_loading.py", line 17, in import_string
module = import_module(module_path)
File "/var/lib/awx/venv/awx/lib64/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'awx.main.channels'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 86, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 62, in run_callable
return _call(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/tasks.py", line 16, in run_task_manager
TaskManager().schedule()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 583, in schedule
self._schedule()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/transaction.py", line 284, in __exit__
connection.set_autocommit(True)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 410, in set_autocommit
self.run_and_clear_commit_hooks()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 636, in run_and_clear_commit_hooks
func()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/models/unified_jobs.py", line 1255, in <lambda>
connection.on_commit(lambda: self._websocket_emit_status(status))
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/models/unified_jobs.py", line 1245, in _websocket_emit_status
emit_channel_notification('jobs-status_changed', status_data)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/consumers.py", line 230, in emit_channel_notification
channel_layer = get_channel_layer()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 363, in get_channel_layer
return channel_layers[alias]
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 80, in __getitem__
self.backends[key] = self.make_backend(key)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 46, in make_backend
return self._make_backend(name, config)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 73, in _make_backend
% (self.configs[name]["BACKEND"], name)
channels.exceptions.InvalidChannelLayerError: Cannot import BACKEND 'awx.main.channels.RedisGroupBroadcastChannelLayer' specified for default
It looks like my CHANNEL_LAYERS / BROKER_URL needed some updating: https://github.com/ansible/awx/blob/devel/awx/settings/defaults.py#L932-L941
It looks like AWX's default install uses:
BROKER_URL = 'unix:///var/run/redis/redis.sock'
Which is a little wild, as that assumes Redis is running on the same host and has the unix socket available... that's not a very sustainable solution if you want to run Redis with HA or in a separate scalable instance.
I was asking about the choice of socket instead of TCP by default, and two main reasons were given:
I concede that these reasons are okay for single-server deployments but it gets a bit murky when talking about deploying in K8s/OCP, or even Docker (though the Docker setup is probably on the same machine 99% of the time).
Latest commit works fine in CI, as I reduced the memory commitments (I think with the task/web using 1Gi each, we're bumping into CI instance RAM limits!).
I haven't fully tested Tower 3.7.0 yet, but may try again tonight when I get a little more time to set up the pull secret for the Red Hat Registry.
However, this image is ready to go, and for those who are using it for AWX, they'll be happy to be able to install the latest version again, using Redis for the queue.
See Tower Release Notes: https://docs.ansible.com/ansible-tower/3.7.0/html/release-notes/#
One of the major changes:
So we'll need to update the operator to use Redis instead of Rabbit.
From #39:
The latest versions (3.7.0 / 10(?).0.0) will soon be using Redis instead of RabbitMQ; more info here: https://github.com/ansible/awx/issues/5443
Docker Compose changes: https://github.com/ansible/awx/pull/6034/files#diff-ee215160a0808b30b25efa63ca9ac0f9
Kubernetes role changes:
Though some have run into issues (see https://github.com/ansible/awx/issues/6365) — so for this operator it may be prudent to wait a little.