The workflows get stuck in the "Running" state

elibogomolnyi commented 1 year ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

When we run more than 300 workflows simultaneously, the workflows get stuck in the "Running" state. If we get inside the workflow schema, we can see that the first job of the workflow was successfully finished, but the second job was not even started.

We can also see the same behavior when the workflow gets stuck in the middle or even at the end of the workflow (in the last case, we can see that all jobs were successfully finished, but the workflow is still stuck in the "Running" state). We are running our jobs in containers in the awx-workers namespace. If I check the pods in this namespace when this issue happens, I can see that there are no pods at all, but all of the workflows are still "Running" in AWX's UI.

It is worth mentioning that we use 10 AWX replicas for our processes. By the way, we have upgraded the AWX version from 21.2.0 to 21.10.2; we are not sure if it is related because we have never run this amount of workflows before. The EKS version is v1.24.8-eks-ffeb93d

When this issue happens, we get many error logs in the awx-task container that look like this:

awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.workflow_manager(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 173, in __get__
    rel_obj = self.field.get_cached_value(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/mixins.py", line 15, in get_cached_value
    return instance._state.fields_cache[cache_name]
KeyError: 'last_job'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 102, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 77, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/tasks.py", line 35, in workflow_manager
    run_manager(WorkflowManager, "workflow")
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/tasks.py", line 20, in run_manager
    manager().schedule()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 136, in schedule
    self._schedule()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 57, in inner
    result = func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 267, in _schedule
    self.spawn_workflow_graph_jobs()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 57, in inner
    result = func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 244, in spawn_workflow_graph_jobs
    can_start = job.signal_start(**start_args)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/workflow.py", line 911, in signal_start
    can_start = super(WorkflowApproval, self).signal_start(**kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 1403, in signal_start
    self.update_fields(start_args=json.dumps(kwargs), status='pending')
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/base.py", line 143, in update_fields
    self.save(update_fields=update_fields)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/workflow.py", line 890, in save
    super(WorkflowApproval, self).save(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 916, in save
    self._update_parent_instance()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 834, in _update_parent_instance
    parent_instance.save(update_fields=update_fields)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 276, in save
    updated_fields = self._set_status_and_last_job_run(save=False)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 301, in _set_status_and_last_job_run
    last_job_run = self._get_last_job_run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 296, in _get_last_job_run
    if self.last_job:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 187, in __get__
    rel_obj = self.get_object(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 154, in get_object
    return qs.get(self.field.get_reverse_related_filter(instance))
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 435, in get
    raise self.model.DoesNotExist(
awx.main.models.unified_jobs.UnifiedJob.DoesNotExist: UnifiedJob matching query does not exist.

AWX version

21.10.2

Select the relevant components

[X] UI
[X] API
[ ] Docs
[X] Collection
[ ] CLI
[ ] Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

Chrome

Steps to reproduce

Run more than 300 workflows in AWX

Expected results

All workflows are finished successfully

Actual results

Workflows are stuck in the "Running" state

Additional information

No response

fosterseth commented 1 year ago

@AlanCoding pinging just to bring to your attention to the django errors above

fosterseth commented 1 year ago

@elibogomolnyi Once it gets into this state, do any jobs launch successfully? or is the system completely unresponsive at that point?

elibogomolnyi commented 1 year ago

@fosterseth, @AlanCoding, thank you for your reply. Once it gets in this state, the system is completely unresponsive. We can try to run another job, but these jobs won't create any pods in the cluster. I also want to share with you how we recover from this state.

First, we try to cancel all fake-running jobs in UI. We do it once for all of them, and some of them get canceled after about 5 minutes. Then we cancel the rest of the jobs again and again, and in some time we succeed in canceling all of them
We shrink our AWX replicas to 0 replicas and then return them to 10. After it, AWX started working as expected.

I also want to add some additional points:

All our logs are saved in DataDog, and we can check if we can share them if it helps to troubleshoot this issue. For example, we can see that Redis is occasionally restarting for some reason. Maybe it is related to the problem.
We did some more performance tests, and now we are unsure about the number of workflows that causes this error. But what we can say for sure is that it happens only when our AWX is under load.
Please tell me if you have any more questions, we will be glad to help

AlanCoding commented 1 year ago

Here is a script that can reproduce a big part part of that traceback.

        wat = WorkflowApprovalTemplate.objects.create(name='foo')
        aj1 = wat.create_unified_job()
        aj1.signal_start()
        aj1.status = 'successful'
        aj1.save()

        wat = WorkflowApprovalTemplate.objects.get(pk=wat.id)
        aj1.delete()
        aj2 = wat.create_unified_job()
        aj2.signal_start()

In the 2nd block, we fetch the template and then after fetching it we delete one of its related objects. It even reproduces the KeyError.

So, this is to say, "if the task manager did this, then that error would happen". But there is no clear reason the task manager would ever do that.

It helps to know that this happens when the system is under load, but the mechanics of workflow approvals make it difficult to contrive of a situation that could cause this. Every workflow node with an approval template should have its own approval template, which would make it less likely for multiple workflows to interfere... Are you running the same workflow concurrently?

If you go into awx-manage shell_plus can you do this without getting a traceback? [t.last_job for t in WorkflowApprovalTemplate.objects.all()] because if that throws the error by itself then your data is corrupted due to something with on_delete behavior. Maybe check this while the system is in the unresponsive state.

How is approving being done in your system? Manually through the UI? Are you scripting that somehow? Do your scripts aggressively delete data?

It is extremely tempting to go through and refactor to avoid that call stack, because _update_parent_instance is a UnifiedJob method that is redundant with _set_status_and_last_job_run on the UnifiedJobTemplate model. However, as history shows, refactors are always very difficult to complete. If we could identify the root cause of this it could help justify spending some work on that.

AlanCoding commented 1 year ago

https://github.com/ansible/awx/pull/13553 has come together better than I thought it would, and you could try that as a potential fix.

AlanCoding commented 1 year ago

I've actually come around to believe we should run with that PR. If we could get more info to have a full reproducer, that would be great.

elibogomolnyi commented 1 year ago

@AlanCoding, sorry for the delay; it took us some time to reproduce this error. For now, we have 400 fake-running workflows without any actual worker pods in K8s.

Now I can see that I was wrong when I was answering your question:

Once it gets into this state, do any jobs launch successfully? or is the system completely unresponsive at that point?

When it gets to this state, I can run some workflow, and it will be finished as expected. But if we ran the same workflow with the manual approvals, it gets stuck in the Running state without running any jobs from its schema.

Are you running the same workflow concurrently?

Yes, we ran the same workflow 400 times concurrently with different inventories and extra variables. We have a wait_for_approval workflow as a part of the workflow that we run, and it is triggered each time one of the stages fails. Here is the schema of the workflow that we ran 400 times. I hope it will help to understand it better (I've changed the jobs and workflow names to explain the main idea). According to this schema, this workflow should be successfully finished, but it doesn't, and it is stuck as "Running" in UI. Screenshot 2023-02-15 at 11 49 45

We can also see only one wait_for_approval workflow that is stuck in UI in the running state, not its jobs are even. Screenshot 2023-02-15 at 17 32 27 Can it be that it makes all the workflows get stuck?

If you go into awx-manage shell_plus can you do this without getting a traceback? [t.last_job for t in WorkflowApprovalTemplate.objects.all()] because if that throws the error by itself then your data is corrupted due to something with on_delete behavior. Maybe check this while the system is in the unresponsive state.

I've tried it when the system is in an unresponsive state. This is what I get:

[t.last_job for t in WorkflowApprovalTemplate.objects.all()] Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 173, in get rel_obj = self.field.get_cached_value(instance) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/mixins.py", line 15, in get_cached_value return instance._state.fields_cache[cache_name] KeyError: 'last_job'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "", line 1, in File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 187, in get rel_obj = self.get_object(instance) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 154, in get_object return qs.get(self.field.get_reverse_related_filter(instance)) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 435, in get raise self.model.DoesNotExist( awx.main.models.unified_jobs.UnifiedJob.DoesNotExist: UnifiedJob matching query does not exist.

So yes, we get the same traceback. Does it mean that our data is corrupted? We are not fixing our system for now, so you could tell us if we should check anything else in the current broken state.

How is approving being done in your system? Manually through the UI? Are you scripting that somehow?

Only manually, with no automation

Do your scripts aggressively delete data?

No

elibogomolnyi commented 1 year ago

@AlanCoding, today we have resolved this issue. You were right that it was about the corrupted database. When we checked the AWX UI, we saw that all workflow approvals were somehow deleted.

To resolve this issue, we've erased the value of the last_job_id from the public.main_unifiedjobtemplate of our approval template. For some reason, it was pointing to the non-existing job (that we could also find in the public.main_unifiedjobtemplate table, by the way. But it was not accessible by UI, so we deleted the link to this job, and everything started working.

So this issue is not related to load but to the jobs that trigger the manual approval.

Thank you very much for your help. We wonder what could delete all workflow approvals from the system without updating a record in public.main_unifiedjobtemplate database.

Everything we have, related to the cleanup is this Cleanup Activity Schedule. Can this kind of schedule cause this issue? Screenshot 2023-02-16 at 16 01 43

AlanCoding commented 1 year ago

Interesting finds. I still come up empty-handed about why the workflow approvals might have been deleted. The system job you mentioned is about activity stream so I might rule that out. The view should prevent deleting the workflow approval while it's running.

Anyway, I still believe there is value in the refactor I linked earlier, which may avoid these issues even if the bad state happens. It is also valuable to increase processing speed to begin with.

elibogomolnyi commented 1 year ago

@AlanCoding, thank you for your reply. I am closing this issue for now.

ansible / awx