geerlingguy / tower-operator

DEPRECATED: This project was moved and renamed to: https://github.com/ansible/awx-operator
82 stars 34 forks source link

Update to AWX 11.2.0, Tower 3.7 #42

Closed geerlingguy closed 4 years ago

geerlingguy commented 4 years ago

See Tower Release Notes: https://docs.ansible.com/ansible-tower/3.7.0/html/release-notes/#

One of the major changes:

Updated Tower to no longer rely on RabbitMQ; Redis is added as a new dependency

So we'll need to update the operator to use Redis instead of Rabbit.

From #39:

The latest versions (3.7.0 / 10(?).0.0) will soon be using Redis instead of RabbitMQ; more info here: https://github.com/ansible/awx/issues/5443

Docker Compose changes: https://github.com/ansible/awx/pull/6034/files#diff-ee215160a0808b30b25efa63ca9ac0f9

Kubernetes role changes:

Though some have run into issues (see https://github.com/ansible/awx/issues/6365) — so for this operator it may be prudent to wait a little.

geerlingguy commented 4 years ago

Strangely, it seems the AWX/Tower installer is using a shared volume for Redis instead of communicating via a service/port on TCP...?

geerlingguy commented 4 years ago

Ah, maybe it gets configured via the BROKER_URL.

geerlingguy commented 4 years ago

It seems like for the tower image, it used to be in Quay (quay.io/ansible-tower/ansible-tower), but the official Tower OpenShift installer now lists it at the Red Hat Registry (registry.redhat.io/ansible-tower-37/ansible-tower-rhel7)... which requires a valid Red Hat subscription and your cluster to be tied in/authenticated to be able to pull the images.

It's a bit annoying, but I guess the intention may be to not run Tower on non-OpenShift Kubernetes clusters? Or maybe someone just forgot to run the job to push the tower images out to Quay.io.

geerlingguy commented 4 years ago

For now I'll push up my working branch (with the redis changes) since it could help with getting at least the latest AWX versions supported.

geerlingguy commented 4 years ago

Hmm... getting:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 122, in run
    res = queue.blpop(self.queues)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/client.py", line 1865, in blpop
    return self.execute_command('BLPOP', *keys)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/client.py", line 875, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/connection.py", line 1185, in get_connection
    connection.connect()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/redis/connection.py", line 557, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to example-tower-redis.example-tower.svc.cluster.local:6379. Connection refused.
geerlingguy commented 4 years ago

I tried connecting with a debug container:

$ kubectl run redis-cli --rm -n example-tower -it --image=goodsmileduck/redis-cli
/ # redis-cli -h example-tower-redis.example-tower.svc.cluster.local -p 6379 ping
Could not connect to Redis at example-tower-redis.example-tower.svc.cluster.local:6379: Connection refused

So then I checked the redis container logs:

$ kubectl logs -n example-tower example-tower-redis-6d5c655f9f-h495j
1:C 26 May 2020 20:30:40.828 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 26 May 2020 20:30:40.828 # Redis version=6.0.3, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 26 May 2020 20:30:40.828 # Configuration loaded
1:M 26 May 2020 20:30:40.829 * Running mode=standalone, port=0.
1:M 26 May 2020 20:30:40.829 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 26 May 2020 20:30:40.829 # Server initialized
1:M 26 May 2020 20:30:40.829 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 26 May 2020 20:30:40.830 * The server is now ready to accept connections at /var/run/redis/redis.sock

So it looks like it's only listening on the socket, and not on TCP...

geerlingguy commented 4 years ago

Got that fixed by switching redis to run on TCP port only, but now I'm getting the following when I try to launch a job from a template:

Call to /api/v2/job_templates/7/launch failed. POST returned status: 500. A server error has occurred.

geerlingguy commented 4 years ago

Error in task container:

2020-05-26 21:12:35,014 ERROR    awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.run_task_manager(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 67, in _make_backend
    backend_class = import_string(self.configs[name]["BACKEND"])
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/utils/module_loading.py", line 17, in import_string
    module = import_module(module_path)
  File "/var/lib/awx/venv/awx/lib64/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'awx.main.channels'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 86, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/task.py", line 62, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/tasks.py", line 16, in run_task_manager
    TaskManager().schedule()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/scheduler/task_manager.py", line 583, in schedule
    self._schedule()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/transaction.py", line 284, in __exit__
    connection.set_autocommit(True)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 410, in set_autocommit
    self.run_and_clear_commit_hooks()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 636, in run_and_clear_commit_hooks
    func()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/models/unified_jobs.py", line 1255, in <lambda>
    connection.on_commit(lambda: self._websocket_emit_status(status))
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/models/unified_jobs.py", line 1245, in _websocket_emit_status
    emit_channel_notification('jobs-status_changed', status_data)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/consumers.py", line 230, in emit_channel_notification
    channel_layer = get_channel_layer()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 363, in get_channel_layer
    return channel_layers[alias]
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 80, in __getitem__
    self.backends[key] = self.make_backend(key)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 46, in make_backend
    return self._make_backend(name, config)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/channels/layers.py", line 73, in _make_backend
    % (self.configs[name]["BACKEND"], name)
channels.exceptions.InvalidChannelLayerError: Cannot import BACKEND 'awx.main.channels.RedisGroupBroadcastChannelLayer' specified for default
geerlingguy commented 4 years ago

It looks like my CHANNEL_LAYERS / BROKER_URL needed some updating: https://github.com/ansible/awx/blob/devel/awx/settings/defaults.py#L932-L941

It looks like AWX's default install uses:

BROKER_URL = 'unix:///var/run/redis/redis.sock'

Which is a little wild, as that assumes Redis is running on the same host and has the unix socket available... that's not a very sustainable solution if you want to run Redis with HA or in a separate scalable instance.

geerlingguy commented 4 years ago

I was asking about the choice of socket instead of TCP by default, and two main reasons were given:

I concede that these reasons are okay for single-server deployments but it gets a bit murky when talking about deploying in K8s/OCP, or even Docker (though the Docker setup is probably on the same machine 99% of the time).

geerlingguy commented 4 years ago

Latest commit works fine in CI, as I reduced the memory commitments (I think with the task/web using 1Gi each, we're bumping into CI instance RAM limits!).

I haven't fully tested Tower 3.7.0 yet, but may try again tonight when I get a little more time to set up the pull secret for the Red Hat Registry.

However, this image is ready to go, and for those who are using it for AWX, they'll be happy to be able to install the latest version again, using Redis for the queue.