Closed azrdev closed 2 years ago
It has been corrected in 19.2.2.
@azrdev is this fixed in >= 19.2.2 for you?
I'll tell you as soon as our AWX is updated, sorry I'll have to wait some days for that to happen
@chrismeyersfsu @Seb0042 on 19.3.0 we still see our long-running jobs' state not correctly synced with AWX, even though its stdout is displayed fully.
The UI-visible error has changed, though: it's not stuck "running" anymore, but fails with status
Task was marked as running but was not present in the job queue, so it has been marked as failed.
Is this the same issue as #10211 and #11087 ?
Are you using the database provisioned by the operator, or connecting to your own?
A couple more questions:
I have the same issue with a fresh new 19.3.0 and 19.4.0 Kubernetes setup with external Postgres 12 DB. Every job that runs longer than 5 min will result in the same failed status except that the playbook runs were all successful.
$ awx jobs list -k -f human --filter id,name,created,finished,status,controller_node,job_explanation
id name created finished status controller_node job_explanation
== ============ =========================== =========================== ========== ==================== =================================================================================================
4 linux-baseos 2021-10-14T10:11:54.567438Z 2021-10-14T10:14:09.641713Z failed awx-6dc7478d4-vpcsw
5 linux-baseos 2021-10-14T10:19:06.411401Z 2021-10-14T10:24:28.104199Z failed awx-6dc7478d4-vpcsw Task was marked as running but was not present in the job queue, so it has been marked as failed.
7 linux-baseos 2021-10-18T03:13:13.772004Z 2021-10-18T04:23:28.737044Z successful awx-686f6c755c-7gn28
11 linux-baseos 2021-10-18T06:47:03.580629Z 2021-10-18T06:47:19.530979Z failed awx-686f6c755c-7gn28
13 linux-baseos 2021-10-18T06:51:37.844249Z 2021-10-18T07:03:12.246307Z failed awx-686f6c755c-7gn28 Task was marked as running but was not present in the job queue, so it has been marked as failed.
14 linux-baseos 2021-10-18T07:21:01.486229Z 2021-10-18T07:38:05.904964Z failed awx-686f6c755c-7gn28 Task was marked as running but was not present in the job queue, so it has been marked as failed.
15 linux-baseos 2021-10-18T08:42:38.580710Z 2021-10-18T09:02:33.781595Z failed awx-5b4f4f57c5-xhgpc Task was marked as running but was not present in the job queue, so it has been marked as failed.
17 linux-baseos 2021-10-18T12:00:44.663579Z 2021-10-18T12:23:20.193032Z canceled awx-5b4f4f57c5-xhgpc
18 linux-baseos 2021-10-18T12:23:23.649428Z 2021-10-18T12:42:54.849812Z failed awx-7bfcf8d59c-ggsbd Task was marked as running but was not present in the job queue, so it has been marked as failed.
Job 20 log
Oct 19 09:42:45 awx-7bfcf8d59c-ggsbd awx-task DEBUG [abc4f02e214443f3911319adb23d6591] awx.main.dispatch task 6b084733-eb45-45ca-9f26-577854b5ee10 starting awx.main.tasks.handle_success_and_failure_notifications(*[20])
Oct 19 09:42:45 awx-7bfcf8d59c-ggsbd awx-task INFO [00faf08544f440d98ab3e5e88a20dd4a] awx.main.commands.run_callback_receiver Event processing is finished for Job 20, sending notifications
Oct 19 09:42:45 awx-7bfcf8d59c-ggsbd awx-task INFO [00faf08544f440d98ab3e5e88a20dd4a] awx.main.commands.run_callback_receiver Event processing is finished for Job 20, sending notifications
Oct 19 09:42:46 awx-7bfcf8d59c-ggsbd awx-task ERROR [00faf08544f440d98ab3e5e88a20dd4a] awx.main.tasks job 20 (running) Exception occurred while running task
Oct 19 09:42:46 awx-7bfcf8d59c-ggsbd awx-task Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 164, in __get__
rel_obj = self.field.get_cached_value(instance)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/mixins.py", line 13, in get_cached_value
return instance._state.fields_cache[cache_name]
Oct 19 09:42:46 awx-7bfcf8d59c-ggsbd awx-task KeyError: 'instance_group'
Oct 19 09:42:46 awx-7bfcf8d59c-ggsbd awx-task During handling of the above exception, another exception occurred:
Oct 19 09:42:46 awx-7bfcf8d59c-ggsbd awx-task Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
Oct 19 09:42:46 awx-7bfcf8d59c-ggsbd awx-task psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Complete job 20 log --> job20-awx.txt
@shanemcd sorry for the delay after your questions:
Are you using the database provisioned by the operator, or connecting to your own?
We use an external database. We're currently looking into the possibility that something else (database, firewall,...) drops long-lived (maybe unused) connections, but given there are the two named other tickets + showblue here, I'm not sure it's not a AWX's fault.
Is it possible that some control plane pods are getting evicted and re-launching?
Are you maybe scaling the control plane up and down around the time this happens?
no, this is not the case.
The DB admins told us that indeed the DB closes unused connections after 10 minutes. Does AWX offer some kind of keepalive for the DB connection, in case we cannot change this on the DB side?
Same here. With AWX version 19.4. Database and AWX launched by the AWX operator in the same k8s namespace
AWX does not have a feature to explicitly keep the database connection alive. Maybe there is a Django feature you can tweak to do this for you?
@chrismeyersfsu django has CONN_MAX_AGE and mentions the problem explicitly:
If your database terminates idle connections after some time, you should set CONN_MAX_AGE to a lower value, so that Django doesn’t attempt to use a connection that has been terminated by the database server
on https://docs.djangoproject.com/en/3.2/ref/databases/#persistent-connections
I cannot see where AWX (specifically awx_task, I guess) sets its django config/options, which would also be the place to override CONN_MAX_AGE to something smaller than our external DB has -- by default it's apparently on a high value or None
aka unlimited. Could you point to the relevant AWX code?
Hello. Based on the few number of folks seeing this, it seems likely that the problem is with your environment. If you need help troubleshooting or are looking for help using AWX, try our mailing list or IRC channel:
https://groups.google.com/forum/#!forum/awx-project
If after further troubleshooting you still think this is a bug in AWX, please open a new issue with any information you find.
Please confirm the following
Summary
I have a WFJ which has completed all tasks but is stuck in the "running" state.
kubectl logs
shows traceback(s):AWX version
19.2.0
Installation method
kubernetes
Modifications
yes
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
If I knew /o\
Expected results
the job to terminate
Actual results
keeps "running"
Additional information
Customization: our database is external to the k8s cluster.
(Maybe) related tickets:
4341
10489
10151