AWX - Plateform instability and project sync failure

sylvain-de-fuster commented 1 year ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
[X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

Our environment :

1 awx master on k3s with 2 external workers (using receptor)
projects are hosted on enterprise github
SAML auth
Upgraded from 21.10.0 to 22.1.0 (~mid june)
Control node EE untouched (only custom workers EE for playbooks)
No limit added to pods

We have recently issues on AWX. It seems random (maybe related with some sort of accumulation, or burst, or else). The web interface isn't working correctly (slow, some errors) and any project sync fails. Without the root cause, the only thing we can do is scale down web/task deployments so the operator restart them. Everything is ok after that.

We checked server's system (cpu/mem/disk/net/etc..) and logs without any luck. No specific action on the plateform. No high number of jobs running.

AWX version

22.1.0

Select the relevant components

[ ] UI
[ ] UI (tech preview)
[ ] API
[ ] Docs
[ ] Collection
[ ] CLI
[X] Other

Installation method

kubernetes

Modifications

no

Ansible version

2.14

Operating system

RHEL 8.6

Web browser

No response

Steps to reproduce

Didn't find a way to reproduce at will.

Expected results

The interface is smooth. Project sync works.

Actual results

Interface is slow to show and all project sync fails (so no job with outdated project cache can be executed)

Additional information

• Example of project sync error

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 604, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 317, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 392, in _run_internal
    res = processor_future.result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py", line 1188, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 460, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py", line 211, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 355, in run
    self.event_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 310, in event_callback
    should_write = self.event_handler(event_data)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/callback.py", line 222, in event_handler
    super_return_value = super(RunnerCallbackForProjectUpdate, self).event_handler(event_data)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/callback.py", line 166, in event_handler
    self.dispatcher.dispatch(event_data)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/queue.py", line 32, in dispatch
    self.connection.rpush(self.queue, json.dumps(obj, cls=AnsibleJSONEncoder))
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/commands/core.py", line 2695, in rpush
    return self.execute_command("RPUSH", name, *values)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1238, in execute_command
    return conn.retry.call_with_retry(
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/retry.py", line 49, in call_with_retry
    fail(error)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1242, in <lambda>
    lambda error: self._disconnect_raise(conn, error),
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1228, in _disconnect_raise
    raise error
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/retry.py", line 46, in call_with_retry
    return do()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1239, in <lambda>
    lambda: self._send_command_parse_response(
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1215, in _send_command_parse_response
    return self.parse_response(conn, command_name, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/client.py", line 1254, in parse_response
    response = connection.read_response()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/connection.py", line 824, in read_response
    response = self._parser.read_response(disable_decoding=disable_decoding)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/redis/connection.py", line 487, in read_response
    raise response
redis.exceptions.ConnectionError: max number of clients reached

Checked redis configuration (~20 clients connected for 10K limit). With AWX metrics, I found some differences but I can't properly interpret them (I guess some kind of burst but I don't really know).

• Metrics near platform issues

# HELP awx_database_connections_total Number of connections to database
# TYPE awx_database_connections_total gauge
awx_database_connections_total 15.0
# HELP callback_receiver_events_queue_size_redis Current number of events in redis queue
# TYPE callback_receiver_events_queue_size_redis gauge
callback_receiver_events_queue_size_redis{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP callback_receiver_events_popped_redis Number of events popped from redis
# TYPE callback_receiver_events_popped_redis gauge
callback_receiver_events_popped_redis{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 241058
# HELP callback_receiver_events_in_memory Current number of events in memory (in transfer from redis to db)
# TYPE callback_receiver_events_in_memory gauge
callback_receiver_events_in_memory{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP callback_receiver_batch_events_errors Number of times batch insertion failed
# TYPE callback_receiver_batch_events_errors gauge
callback_receiver_batch_events_errors{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP callback_receiver_events_insert_db_seconds Total time spent saving events to database
# TYPE callback_receiver_events_insert_db_seconds gauge
callback_receiver_events_insert_db_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 6633.5608182273245
# HELP callback_receiver_events_insert_db Number of events batch inserted into database
# TYPE callback_receiver_events_insert_db gauge
callback_receiver_events_insert_db{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 236855
# HELP callback_receiver_events_broadcast Number of events broadcast to other control plane nodes
# TYPE callback_receiver_events_broadcast gauge
callback_receiver_events_broadcast{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 175081
# HELP callback_receiver_batch_events_insert_db Number of events batch inserted into database
# TYPE callback_receiver_batch_events_insert_db histogram
callback_receiver_batch_events_insert_db_bucket{le="10",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 122780
callback_receiver_batch_events_insert_db_bucket{le="50",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124498
callback_receiver_batch_events_insert_db_bucket{le="150",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_bucket{le="350",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_bucket{le="650",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_bucket{le="2000",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_bucket{le="+Inf",node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_count{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 124528
callback_receiver_batch_events_insert_db_sum{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 236855
# HELP callback_receiver_event_processing_avg_seconds Average processing time per event per callback receiver batch
# TYPE callback_receiver_event_processing_avg_seconds gauge
callback_receiver_event_processing_avg_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 1.6913103333333332
# HELP subsystem_metrics_pipe_execute_seconds Time spent saving metrics to redis
# TYPE subsystem_metrics_pipe_execute_seconds gauge
subsystem_metrics_pipe_execute_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 118.10413664561273
# HELP subsystem_metrics_pipe_execute_calls Number of calls to pipe_execute
# TYPE subsystem_metrics_pipe_execute_calls gauge
subsystem_metrics_pipe_execute_calls{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 71599
# HELP subsystem_metrics_send_metrics_seconds Time spent sending metrics to other nodes
# TYPE subsystem_metrics_send_metrics_seconds gauge
subsystem_metrics_send_metrics_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 355.07566434563284
# HELP task_manager_get_tasks_seconds Time spent in loading tasks from db
# TYPE task_manager_get_tasks_seconds gauge
task_manager_get_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.005228978581726551
# HELP task_manager_start_task_seconds Time spent starting task
# TYPE task_manager_start_task_seconds gauge
task_manager_start_task_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.0
# HELP task_manager_process_running_tasks_seconds Time spent processing running tasks
# TYPE task_manager_process_running_tasks_seconds gauge
task_manager_process_running_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.0
# HELP task_manager_process_pending_tasks_seconds Time spent processing pending tasks
# TYPE task_manager_process_pending_tasks_seconds gauge
task_manager_process_pending_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.0
# HELP task_manager__schedule_seconds Time spent in running the entire _schedule
# TYPE task_manager__schedule_seconds gauge
task_manager__schedule_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.016287230886518955
# HELP task_manager__schedule_calls Number of calls to _schedule, after lock is acquired
# TYPE task_manager__schedule_calls gauge
task_manager__schedule_calls{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 254603
# HELP task_manager_recorded_timestamp Unix timestamp when metrics were last recorded
# TYPE task_manager_recorded_timestamp gauge
task_manager_recorded_timestamp{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 1688769903.3931773
# HELP task_manager_tasks_started Number of tasks started
# TYPE task_manager_tasks_started gauge
task_manager_tasks_started{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP task_manager_running_processed Number of running tasks processed
# TYPE task_manager_running_processed gauge
task_manager_running_processed{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP task_manager_pending_processed Number of pending tasks processed
# TYPE task_manager_pending_processed gauge
task_manager_pending_processed{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP task_manager_tasks_blocked Number of tasks blocked from running
# TYPE task_manager_tasks_blocked gauge
task_manager_tasks_blocked{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0
# HELP task_manager_commit_seconds Time spent in db transaction, including on_commit calls
# TYPE task_manager_commit_seconds gauge
task_manager_commit_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.0002639293670654297
# HELP dependency_manager_get_tasks_seconds Time spent loading pending tasks from db
# TYPE dependency_manager_get_tasks_seconds gauge
dependency_manager_get_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-9gxzj"} 0.007382605224847794

• Metrics with healthy platform

# HELP awx_database_connections_total Number of connections to database
# TYPE awx_database_connections_total gauge
awx_database_connections_total 15.0
# HELP callback_receiver_events_queue_size_redis Current number of events in redis queue
# TYPE callback_receiver_events_queue_size_redis gauge
callback_receiver_events_queue_size_redis{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP callback_receiver_events_popped_redis Number of events popped from redis
# TYPE callback_receiver_events_popped_redis gauge
callback_receiver_events_popped_redis{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 4836
# HELP callback_receiver_events_in_memory Current number of events in memory (in transfer from redis to db)
# TYPE callback_receiver_events_in_memory gauge
callback_receiver_events_in_memory{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP callback_receiver_batch_events_errors Number of times batch insertion failed
# TYPE callback_receiver_batch_events_errors gauge
callback_receiver_batch_events_errors{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP callback_receiver_events_insert_db_seconds Total time spent saving events to database
# TYPE callback_receiver_events_insert_db_seconds gauge
callback_receiver_events_insert_db_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 107.60422206111252
# HELP callback_receiver_events_insert_db Number of events batch inserted into database
# TYPE callback_receiver_events_insert_db gauge
callback_receiver_events_insert_db{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 4779
# HELP callback_receiver_events_broadcast Number of events broadcast to other control plane nodes
# TYPE callback_receiver_events_broadcast gauge
callback_receiver_events_broadcast{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 3373
# HELP callback_receiver_batch_events_insert_db Number of events batch inserted into database
# TYPE callback_receiver_batch_events_insert_db histogram
callback_receiver_batch_events_insert_db_bucket{le="10",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2962
callback_receiver_batch_events_insert_db_bucket{le="50",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="150",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="350",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="650",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="2000",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_bucket{le="+Inf",node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_count{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2990
callback_receiver_batch_events_insert_db_sum{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 4779
# HELP callback_receiver_event_processing_avg_seconds Average processing time per event per callback receiver batch
# TYPE callback_receiver_event_processing_avg_seconds gauge
callback_receiver_event_processing_avg_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 1.7844824285714285
# HELP subsystem_metrics_pipe_execute_seconds Time spent saving metrics to redis
# TYPE subsystem_metrics_pipe_execute_seconds gauge
subsystem_metrics_pipe_execute_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 2.358793856576085
# HELP subsystem_metrics_pipe_execute_calls Number of calls to pipe_execute
# TYPE subsystem_metrics_pipe_execute_calls gauge
subsystem_metrics_pipe_execute_calls{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 1727
# HELP subsystem_metrics_send_metrics_seconds Time spent sending metrics to other nodes
# TYPE subsystem_metrics_send_metrics_seconds gauge
subsystem_metrics_send_metrics_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 5.858100923709571
# HELP task_manager_get_tasks_seconds Time spent in loading tasks from db
# TYPE task_manager_get_tasks_seconds gauge
task_manager_get_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.005362008698284626
# HELP task_manager_start_task_seconds Time spent starting task
# TYPE task_manager_start_task_seconds gauge
task_manager_start_task_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.0
# HELP task_manager_process_running_tasks_seconds Time spent processing running tasks
# TYPE task_manager_process_running_tasks_seconds gauge
task_manager_process_running_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.0
# HELP task_manager_process_pending_tasks_seconds Time spent processing pending tasks
# TYPE task_manager_process_pending_tasks_seconds gauge
task_manager_process_pending_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.0
# HELP task_manager__schedule_seconds Time spent in running the entire _schedule
# TYPE task_manager__schedule_seconds gauge
task_manager__schedule_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.019680513069033623
# HELP task_manager__schedule_calls Number of calls to _schedule, after lock is acquired
# TYPE task_manager__schedule_calls gauge
task_manager__schedule_calls{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 355
# HELP task_manager_recorded_timestamp Unix timestamp when metrics were last recorded
# TYPE task_manager_recorded_timestamp gauge
task_manager_recorded_timestamp{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 1689609843.2095385
# HELP task_manager_tasks_started Number of tasks started
# TYPE task_manager_tasks_started gauge
task_manager_tasks_started{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP task_manager_running_processed Number of running tasks processed
# TYPE task_manager_running_processed gauge
task_manager_running_processed{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP task_manager_pending_processed Number of pending tasks processed
# TYPE task_manager_pending_processed gauge
task_manager_pending_processed{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP task_manager_tasks_blocked Number of tasks blocked from running
# TYPE task_manager_tasks_blocked gauge
task_manager_tasks_blocked{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0
# HELP task_manager_commit_seconds Time spent in db transaction, including on_commit calls
# TYPE task_manager_commit_seconds gauge
task_manager_commit_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.0001938343048095703
# HELP dependency_manager_get_tasks_seconds Time spent loading pending tasks from db
# TYPE dependency_manager_get_tasks_seconds gauge
dependency_manager_get_tasks_seconds{node="awx-fake-name-task-c7dcdcbcf-f54ls"} 0.011181914247572422

fosterseth commented 1 year ago

@sylvain-de-fuster | Checked redis configuration (~20 clients connected for 10K limit).

how are you checking that exactly, ie. was this a redis-cli command? Also we have redis containers in both web and task pods, which one did you check?

sylvain-de-fuster commented 1 year ago

Hello @fosterseth ,

I wasn't clear about my checks.

I use this command indeed : redis-cli -s /var/run/redis/redis.sock I checked for redis containers badly apparently... I checked the web one.

The metrics are completely different :

root@awx-fake-name-web-7b745964dc-p7czc:/data# redis-cli -s /var/run/redis/redis.sock
redis /var/run/redis/redis.sock> info clients
# Clients
connected_clients:30
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:20559
client_recent_max_output_buffer:20504
blocked_clients:1
tracking_clients:0
clients_in_timeout_table:1
redis /var/run/redis/redis.sock>

root@awx-fake-name-task-c7dcdcbcf-f54ls:/data# redis-cli -s /var/run/redis/redis.sock
redis /var/run/redis/redis.sock> info clients
# Clients
connected_clients:1576
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:20559
client_recent_max_output_buffer:0
blocked_clients:5
tracking_clients:0
clients_in_timeout_table:5
redis /var/run/redis/redis.sock>

Theses metrics and all the logs given before are related to our dev environment (both dev and prod environment are impacted by this type of issue).

Our production has ~3900 clients connected for the task's redis (same default 10K limit for maxclients).

What is the purpose of each redis ? Without running jobs, are these numbers of connection a normal situation ?

Thanks for your help

sylvain-de-fuster commented 1 year ago

Hello,

Just checked our current metrics on redis and there is clearly an accumulation behavior for the task's redis client connections.

• Evolution since last week (Jul 20) AWX dev env : 1576 > 2172 AWX prod env : 3900 > 7617

Any idea of what could cause this ? (and hopefully of course a workaround/solution)

Thanks

fosterseth commented 1 year ago

@shanemcd @relrod seems we may be leaking open redis clients?

@sylvain-de-fuster thanks for this info, do you have a lot of users using the UI daily? AWX uses django channels via Redis backend to handle UI websocket connections. Wondering if there are unclosed websocket connections across browsers connected the AWX server.

sylvain-de-fuster commented 1 year ago

@fosterseth ,

Below some details on our usage :

• AWX prod : ~580 users indexed in AWX interface • AWX dev : ~150 users indexed in AWX interface

AWX is mainly accessed "manually" by users but we have also some calls from jenkins/github/curl. I couldn't say (unless these are metrics I could check ?) how many users are using it daily (I think it is safe to cut at least in half the total indexed user number).

We are currently in a vacation period, so it reduce strongly the UI access. (nevertheless, the number of clients in redis increase pretty quickly).

Don't hesitate if there are some informations, tests to ask.

fosterseth commented 1 year ago

I want to investigate the subsystem_metrics broadcasting as well

@sylvain-de-fuster what is your AWX scaling look like, i.e. web and task replicas numbers

sylvain-de-fuster commented 1 year ago

@fosterseth

Same configuration on both dev and production plateform. replicas field set to 1 so task/web deployments are scaled to 1.

jgrantls commented 1 year ago

I ran into this issue today, it was primarily affecting project syncs to Github, but we were also unable to run management jobs during the time we were impacted by this issue. Some of the error messages we were seeing are below (SEO): redis.exceptions.ConnectionError: max number of clients reached redis.exceptions.ConnectionError: Connection not ready

For those out there looking for a quick workaround: (Optional) Prune the system: minikube ssh -- docker system prune

List your pods - take note of the awx-task container name: kubectl get pods

Start a bash shell in the awx-task pod. This should default to the redis container: kubectl exec --stdin --tty <awx-task-container-name> -- /bin/bash

Connect to redis-server via the unix socket from within the redis container inside of the awx-task pod: redis-cli -s /var/run/redis/redis.sock

Set the client timeout to 300 seconds:

redis /var/run/redis/redis.sock> config set timeout 300
OK
redis /var/run/redis/redis.sock> config get timeout
1) "timeout"
2) "300"

(Optional) Run the Cleanup Expired Sessions job within AWX.

Now attempt to sync a project, or run any kind of job. Your AWX should now be functioning again.

sylvain-de-fuster commented 1 year ago

Hello,

Thanks a lot @jgrantls , your informations are very helpful ! Below my last tests with these infos in mind :

• Workaround

root@awx-fake-task-5fb8abc4c5-5f4pk:/data# redis-cli -s /var/run/redis/redis.sock
redis /var/run/redis/redis.sock> info clients
# Clients
connected_clients:2227
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:40960
client_recent_max_output_buffer:0
blocked_clients:5
tracking_clients:0
clients_in_timeout_table:5
redis /var/run/redis/redis.sock> config get timeout
1) "timeout"
2) "0"
redis /var/run/redis/redis.sock> config set timeout 300
OK
redis /var/run/redis/redis.sock> config get timeout
1) "timeout"
2) "300"
redis /var/run/redis/redis.sock> info clients
# Clients
connected_clients:36
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:20559
client_recent_max_output_buffer:0
blocked_clients:5
tracking_clients:0
clients_in_timeout_table:5
redis /var/run/redis/redis.sock> config set timeout 0
OK
redis /var/run/redis/redis.sock> config get timeout
1) "timeout"
2) "0"
redis /var/run/redis/redis.sock>

Note: The purge is made immediately after setting the timeout to 300.

• Check management jobs If I execute manually the jobs, it works but they don't seem to be related to the "connected_clients" information from redis (or the purge doesn't work properly) because the value stays unchanged.

For the record, the schedules are broken for Cleanup Expired Sessions and Cleanup Expired OAuth 2 Tokens Error if I try to go in the schedule page of the two jobs :

TypeError: Cannot read properties of undefined (reading 'sort')

TypeError: Cannot read properties of undefined (reading 'sort')
    at oo (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:850657)
    at Z9 (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:2495944)
    at Ks (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:902957)
    at Al (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:890198)
    at Tl (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:890126)
    at El (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:889989)
    at yl (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:886955)
    at https://awx-fake.local/static/js/main.5a8f0ae4.js:2:836449
    at t.unstable_runWithPriority (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:932711)
    at Vi (https://awx-fake.local/static/js/main.5a8f0ae4.js:2:836226)

If I edit the schedules, I can see that they have some missing parameters. The schedules doen't work.

AWX_2_cleanup_oauthsessions AWX_2_cleanup_sessions

I tried a fresh install and noticed that the two broken schedules are present (so I is not related to update/migration process)

rakesh561 commented 8 months ago

any updates on this issue we are also facing similar issue in our environments.Same Redis error message.

danielabelski commented 8 months ago

Same issue. Fixed with config set timeout 300 and then running Cleanup Expired Sessions

rakesh561 commented 8 months ago

@danielabelski i believe config set timeout 300 usually changes when we restart deployment.Just curious that was my thinking.

mhempstock commented 8 months ago

Same issue, updating the timeout also fixed, need to get this configuration persisted

jenniferKaiser21 commented 8 months ago

Any ideas on how to get this configuration to persist? We have had to perform rollout restarts of deployments to resolve the issue (which from what I can see using the above info-clients command, resets the connected_clients if it has reached 10,000).

mqufflc commented 7 months ago

Same issue occured on our platform.

awx-operator version : 2.5.1

Here is what I got when trying to get client info from redis on one of our awx-task pod :

root@awx-task-574c888989-4lxg2:/data# redis-cli -s /var/run/redis/redis.sock
redis /var/run/redis/redis.sock> info clients
ERR max number of clients reached
redis /var/run/redis/redis.sock> info clients
Error: Broken pipe

We have 3 task pods. That may explained why some jobs are still running as intended, but the issue prevented project updates and workflow / job executions as well.

kladiv commented 7 months ago

+1 same issue occurred Maybe in AWX Operator some settings of redis.conf should be configurable (e.g. maxclients)

jgrantls commented 4 months ago

This issue occurred again on AWX 22.7.0 😞

Same fix from August is still working. The default timeout of 0 needs to be changed.

From the redis docs:

By default recent versions of Redis don't close the connection with the client if the client is idle for many seconds: the connection will remain open forever.

Source

Leaving 10k clients connected forever seems to be unintended behavior as far as AWX is concerned, as it breaks the entire application.

TheRealHaoLiu commented 1 month ago

https://github.com/ansible/awx/pull/15398 im adding a bit more debug information to hopefully track down how we are leaking the redis connection...

current theory is that it is related to

2024-07-11T13:32:51.432519957Z Task was destroyed but it is pending!
2024-07-11T13:32:51.432519957Z task: <Task pending name='Task-247757' coro=<RedisConnection._read_data() done, defined at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fba77bf1700>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:168]>

(seems to be a correlation between that message and the increase of redis connection but its inconclusive)

the PR name all of the asyncio tasks so we can track down whats the offending tasks easier...

if anyone is willing to apply this patch to their system im happy to produce images

TheRealHaoLiu commented 1 month ago

scanning through our code the only place that still uses aioredis is the old channel_redis version we pin to and newer version of channel_redis stop using aioredis (its a defuncted project)

upgrading channel_redis might help here https://github.com/ansible/awx/pull/15329

ansible / awx