ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.03k stars 3.42k forks source link

Job stuck "running", Post run hook errored, DB connection crashed? #11027

Closed azrdev closed 2 years ago

azrdev commented 3 years ago

Please confirm the following

Summary

I have a WFJ which has completed all tasks but is stuck in the "running" state. kubectl logs shows traceback(s):

2021-09-06 10:21:20,837 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.analytics.job_lifecycle job-209 preparing playbook
2021-09-06 10:21:21,000 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.analytics.job_lifecycle job-209 running playbook
...
2021-09-06 11:04:56,228 ERROR    [-] awx.conf.settings Database settings are not available, using defaults.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 81, in _ctit_db_wrapper
    yield
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 411, in __getattr__
    value = self._get_local(name)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 333, in _get_local
    self._preload_cache()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 296, in _preload_cache
    for setting in Setting.objects.filter(key__in=settings_to_cache.keys(), user__isnull=True).order_by('pk'):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 274, in __iter__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
    cursor = self.connection.cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
RESULT 2
OKREADY

...
2021-09-06 11:05:32,664 DEBUG    [f96d53f7b799434ca4f4a35cb069321e] awx.main.dispatch task 98e4ce8a-396e-4330-8f8c-eb50906f4516 starting awx.main.tasks.handle_success_and_failure_notifications(*[209])
2021-09-06 11:05:32,823 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.tasks job 209 (running) Exception occurred while running task
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.ProtocolViolation: server conn crashed?
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1384, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2900, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2941, in _run_internal
    res = list(first_future.done)[0].result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1036, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2982, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 249, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 251, in run
    self.event_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 214, in event_callback
    should_write = self.event_handler(event_data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1203, in event_handler
    self.instance.save(update_fields=['artifacts'])
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/models/unified_jobs.py", line 815, in save
    self_before = self.__class__.objects.get(pk=self.pk)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/polymorphic/query.py", line 60, in _polymorphic_iterator
    o = next(base_iter)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1142, in execute_sql
    cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DatabaseError: server conn crashed?
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

2021-09-06 11:05:32,828 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.tasks job 209 (running) finished running, producing 147 events.
2021-09-06 11:05:32,830 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.tasks job 209 (running) Post run hook errored.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 164, in __get__
    rel_obj = self.field.get_cached_value(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/mixins.py", line 13, in get_cached_value
    return instance._state.fields_cache[cache_name]
KeyError: 'unified_job_template'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.ProtocolViolation: server conn crashed?
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1410, in run
    self.post_run_hook(self.instance, status)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1105, in post_run_hook
    instance.log_lifecycle("post_run")
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/models/unified_jobs.py", line 1471, in log_lifecycle
    if self.unified_job_template:
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 178, in __get__
    rel_obj = self.get_object(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 145, in get_object
    return qs.get(self.field.get_reverse_related_filter(instance))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/polymorphic/query.py", line 60, in _polymorphic_iterator
    o = next(base_iter)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1142, in execute_sql
    cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DatabaseError: server conn crashed?
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

2021-09-06 11:05:32,839 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch Worker failed to run task awx.main.tasks.RunJob(*[209], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 759, in _wrapped
    return f(self, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1420, in run
    self.instance = self.update_model(pk)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 790, in update_model
    with transaction.atomic():
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/transaction.py", line 201, in __enter__
    connection.set_autocommit(False, force_begin_transaction_with_broken_autocommit=True)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 405, in set_autocommit
    self._set_autocommit(autocommit)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
django.db.utils.InterfaceError: connection already closed
2021-09-06 11:05:32,847 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch task 67c935a9-4d24-4652-b14c-9bd62b93fcda starting awx.main.tasks.handle_work_error(*['67c935a9-4d24-4652-b14c-9bd62b93fcda'])
2021-09-06 11:05:32,848 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.tasks Executing error task id 67c935a9-4d24-4652-b14c-9bd62b93fcda, subtasks: [{'type': 'job', 'id': 209}]
2021-09-06 11:05:32,850 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch Worker failed to run task awx.main.tasks.handle_work_error(*['67c935a9-4d24-4652-b14c-9bd62b93fcda'], **{'subtasks': [{'type': 'job', 'id': 209}]}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 759, in _wrapped
    return f(self, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1420, in run
    self.instance = self.update_model(pk)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 790, in update_model
    with transaction.atomic():
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/transaction.py", line 201, in __enter__
    connection.set_autocommit(False, force_begin_transaction_with_broken_autocommit=True)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 405, in set_autocommit
    self._set_autocommit(autocommit)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
django.db.utils.InterfaceError: connection already closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 575, in handle_work_error
    instance = UnifiedJob.get_instance_by_type(each_task['type'], each_task['id'])
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/models/unified_jobs.py", line 496, in get_instance_by_type
    return model.objects.get(id=job_id)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/polymorphic/query.py", line 60, in _polymorphic_iterator
    o = next(base_iter)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
    cursor = self.connection.cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
2021-09-06 11:05:32,851 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch task 67c935a9-4d24-4652-b14c-9bd62b93fcda starting awx.main.tasks.handle_work_success(*[])
2021-09-06 11:05:32,855 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch Worker failed to run task awx.main.tasks.handle_work_success(*[], **{'task_actual': {'type': 'job', 'id': 209}}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 556, in handle_work_success
    instance = UnifiedJob.get_instance_by_type(task_actual['type'], task_actual['id'])
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/models/unified_jobs.py", line 496, in get_instance_by_type
    return model.objects.get(id=job_id)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/polymorphic/query.py", line 60, in _polymorphic_iterator
    o = next(base_iter)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
    cursor = self.connection.cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
2021-09-06 11:05:33,665 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-09-06 11:05:33,665 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-09-06 11:05:33,666 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(3)
2021-09-06 11:05:33,666 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(3)
2021-09-06 11:05:37,724 WARNING  [f96d53f7b799434ca4f4a35cb069321e] awx.main.tasks Failed to even try to send notifications for job '2021-09-06 10:21:10.827687+00:00-209-running' due to job not being in finished state.

AWX version

19.2.0

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

If I knew /o\

Expected results

the job to terminate

Actual results

keeps "running"

Additional information

Customization: our database is external to the k8s cluster.

(Maybe) related tickets:

Seb0042 commented 3 years ago

It has been corrected in 19.2.2.

chrismeyersfsu commented 3 years ago

@azrdev is this fixed in >= 19.2.2 for you?

azrdev commented 3 years ago

I'll tell you as soon as our AWX is updated, sorry I'll have to wait some days for that to happen

azrdev commented 3 years ago

@chrismeyersfsu @Seb0042 on 19.3.0 we still see our long-running jobs' state not correctly synced with AWX, even though its stdout is displayed fully.

The UI-visible error has changed, though: it's not stuck "running" anymore, but fails with status

Task was marked as running but was not present in the job queue, so it has been marked as failed.

Is this the same issue as #10211 and #11087 ?

shanemcd commented 3 years ago

Are you using the database provisioned by the operator, or connecting to your own?

shanemcd commented 3 years ago

A couple more questions:

showblue commented 3 years ago

I have the same issue with a fresh new 19.3.0 and 19.4.0 Kubernetes setup with external Postgres 12 DB. Every job that runs longer than 5 min will result in the same failed status except that the playbook runs were all successful.

$ awx jobs list -k -f human --filter id,name,created,finished,status,controller_node,job_explanation
id name         created                     finished                    status     controller_node      job_explanation
== ============ =========================== =========================== ========== ==================== =================================================================================================
4  linux-baseos 2021-10-14T10:11:54.567438Z 2021-10-14T10:14:09.641713Z failed     awx-6dc7478d4-vpcsw
5  linux-baseos 2021-10-14T10:19:06.411401Z 2021-10-14T10:24:28.104199Z failed     awx-6dc7478d4-vpcsw  Task was marked as running but was not present in the job queue, so it has been marked as failed.
7  linux-baseos 2021-10-18T03:13:13.772004Z 2021-10-18T04:23:28.737044Z successful awx-686f6c755c-7gn28
11 linux-baseos 2021-10-18T06:47:03.580629Z 2021-10-18T06:47:19.530979Z failed     awx-686f6c755c-7gn28
13 linux-baseos 2021-10-18T06:51:37.844249Z 2021-10-18T07:03:12.246307Z failed     awx-686f6c755c-7gn28 Task was marked as running but was not present in the job queue, so it has been marked as failed.
14 linux-baseos 2021-10-18T07:21:01.486229Z 2021-10-18T07:38:05.904964Z failed     awx-686f6c755c-7gn28 Task was marked as running but was not present in the job queue, so it has been marked as failed.
15 linux-baseos 2021-10-18T08:42:38.580710Z 2021-10-18T09:02:33.781595Z failed     awx-5b4f4f57c5-xhgpc Task was marked as running but was not present in the job queue, so it has been marked as failed.
17 linux-baseos 2021-10-18T12:00:44.663579Z 2021-10-18T12:23:20.193032Z canceled   awx-5b4f4f57c5-xhgpc
18 linux-baseos 2021-10-18T12:23:23.649428Z 2021-10-18T12:42:54.849812Z failed     awx-7bfcf8d59c-ggsbd Task was marked as running but was not present in the job queue, so it has been marked as failed.

Job 20 log

Oct 19 09:42:45  awx-7bfcf8d59c-ggsbd  awx-task DEBUG   [abc4f02e214443f3911319adb23d6591] awx.main.dispatch task 6b084733-eb45-45ca-9f26-577854b5ee10 starting awx.main.tasks.handle_success_and_failure_notifications(*[20])
Oct 19 09:42:45  awx-7bfcf8d59c-ggsbd  awx-task INFO    [00faf08544f440d98ab3e5e88a20dd4a] awx.main.commands.run_callback_receiver Event processing is finished for Job 20, sending notifications
Oct 19 09:42:45  awx-7bfcf8d59c-ggsbd  awx-task INFO    [00faf08544f440d98ab3e5e88a20dd4a] awx.main.commands.run_callback_receiver Event processing is finished for Job 20, sending notifications
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task ERROR   [00faf08544f440d98ab3e5e88a20dd4a] awx.main.tasks job 20 (running) Exception occurred while running task
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 164, in __get__
    rel_obj = self.field.get_cached_value(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/mixins.py", line 13, in get_cached_value
    return instance._state.fields_cache[cache_name]
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  KeyError: 'instance_group'
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  During handling of the above exception, another exception occurred:
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  psycopg2.OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

Complete job 20 log --> job20-awx.txt

azrdev commented 3 years ago

@shanemcd sorry for the delay after your questions:

Are you using the database provisioned by the operator, or connecting to your own?

We use an external database. We're currently looking into the possibility that something else (database, firewall,...) drops long-lived (maybe unused) connections, but given there are the two named other tickets + showblue here, I'm not sure it's not a AWX's fault.

  • Is it possible that some control plane pods are getting evicted and re-launching?

  • Are you maybe scaling the control plane up and down around the time this happens?

no, this is not the case.

azrdev commented 3 years ago

The DB admins told us that indeed the DB closes unused connections after 10 minutes. Does AWX offer some kind of keepalive for the DB connection, in case we cannot change this on the DB side?

CWollinger commented 2 years ago

Same here. With AWX version 19.4. Database and AWX launched by the AWX operator in the same k8s namespace

chrismeyersfsu commented 2 years ago

AWX does not have a feature to explicitly keep the database connection alive. Maybe there is a Django feature you can tweak to do this for you?

azrdev commented 2 years ago

@chrismeyersfsu django has CONN_MAX_AGE and mentions the problem explicitly:

If your database terminates idle connections after some time, you should set CONN_MAX_AGE to a lower value, so that Django doesn’t attempt to use a connection that has been terminated by the database server

on https://docs.djangoproject.com/en/3.2/ref/databases/#persistent-connections

I cannot see where AWX (specifically awx_task, I guess) sets its django config/options, which would also be the place to override CONN_MAX_AGE to something smaller than our external DB has -- by default it's apparently on a high value or None aka unlimited. Could you point to the relevant AWX code?

shanemcd commented 2 years ago

Hello. Based on the few number of folks seeing this, it seems likely that the problem is with your environment. If you need help troubleshooting or are looking for help using AWX, try our mailing list or IRC channel:

ansible-awx on https://libera.chat/

https://groups.google.com/forum/#!forum/awx-project

If after further troubleshooting you still think this is a bug in AWX, please open a new issue with any information you find.