Closed elibogomolnyi closed 1 year ago
@AlanCoding pinging just to bring to your attention to the django errors above
@elibogomolnyi Once it gets into this state, do any jobs launch successfully? or is the system completely unresponsive at that point?
@fosterseth, @AlanCoding, thank you for your reply. Once it gets in this state, the system is completely unresponsive. We can try to run another job, but these jobs won't create any pods in the cluster. I also want to share with you how we recover from this state.
I also want to add some additional points:
Here is a script that can reproduce a big part part of that traceback.
wat = WorkflowApprovalTemplate.objects.create(name='foo')
aj1 = wat.create_unified_job()
aj1.signal_start()
aj1.status = 'successful'
aj1.save()
wat = WorkflowApprovalTemplate.objects.get(pk=wat.id)
aj1.delete()
aj2 = wat.create_unified_job()
aj2.signal_start()
In the 2nd block, we fetch the template and then after fetching it we delete one of its related objects. It even reproduces the KeyError
.
So, this is to say, "if the task manager did this, then that error would happen". But there is no clear reason the task manager would ever do that.
It helps to know that this happens when the system is under load, but the mechanics of workflow approvals make it difficult to contrive of a situation that could cause this. Every workflow node with an approval template should have its own approval template, which would make it less likely for multiple workflows to interfere... Are you running the same workflow concurrently?
If you go into awx-manage shell_plus
can you do this without getting a traceback? [t.last_job for t in WorkflowApprovalTemplate.objects.all()]
because if that throws the error by itself then your data is corrupted due to something with on_delete behavior. Maybe check this while the system is in the unresponsive state.
How is approving being done in your system? Manually through the UI? Are you scripting that somehow? Do your scripts aggressively delete data?
It is extremely tempting to go through and refactor to avoid that call stack, because _update_parent_instance
is a UnifiedJob
method that is redundant with _set_status_and_last_job_run
on the UnifiedJobTemplate
model. However, as history shows, refactors are always very difficult to complete. If we could identify the root cause of this it could help justify spending some work on that.
https://github.com/ansible/awx/pull/13553 has come together better than I thought it would, and you could try that as a potential fix.
I've actually come around to believe we should run with that PR. If we could get more info to have a full reproducer, that would be great.
@AlanCoding, sorry for the delay; it took us some time to reproduce this error. For now, we have 400 fake-running workflows without any actual worker pods in K8s.
Now I can see that I was wrong when I was answering your question:
Once it gets into this state, do any jobs launch successfully? or is the system completely unresponsive at that point?
When it gets to this state, I can run some workflow, and it will be finished as expected. But if we ran the same workflow with the manual approvals, it gets stuck in the Running state without running any jobs from its schema.
Are you running the same workflow concurrently?
Yes, we ran the same workflow 400 times concurrently with different inventories and extra variables. We have a wait_for_approval workflow as a part of the workflow that we run, and it is triggered each time one of the stages fails. Here is the schema of the workflow that we ran 400 times. I hope it will help to understand it better (I've changed the jobs and workflow names to explain the main idea). According to this schema, this workflow should be successfully finished, but it doesn't, and it is stuck as "Running" in UI.
We can also see only one wait_for_approval workflow that is stuck in UI in the running state, not its jobs are even. Can it be that it makes all the workflows get stuck?
If you go into awx-manage shell_plus can you do this without getting a traceback? [t.last_job for t in WorkflowApprovalTemplate.objects.all()] because if that throws the error by itself then your data is corrupted due to something with on_delete behavior. Maybe check this while the system is in the unresponsive state.
I've tried it when the system is in an unresponsive state. This is what I get:
[t.last_job for t in WorkflowApprovalTemplate.objects.all()] Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 173, in get rel_obj = self.field.get_cached_value(instance) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/mixins.py", line 15, in get_cached_value return instance._state.fields_cache[cache_name] KeyError: 'last_job'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "
", line 1, in File " ", line 1, in File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 187, in get rel_obj = self.get_object(instance) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 154, in get_object return qs.get(self.field.get_reverse_related_filter(instance)) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 435, in get raise self.model.DoesNotExist( awx.main.models.unified_jobs.UnifiedJob.DoesNotExist: UnifiedJob matching query does not exist.
So yes, we get the same traceback. Does it mean that our data is corrupted? We are not fixing our system for now, so you could tell us if we should check anything else in the current broken state.
How is approving being done in your system? Manually through the UI? Are you scripting that somehow?
Only manually, with no automation
Do your scripts aggressively delete data?
No
@AlanCoding, today we have resolved this issue. You were right that it was about the corrupted database. When we checked the AWX UI, we saw that all workflow approvals were somehow deleted.
To resolve this issue, we've erased the value of the last_job_id from the public.main_unifiedjobtemplate of our approval template. For some reason, it was pointing to the non-existing job (that we could also find in the public.main_unifiedjobtemplate table, by the way. But it was not accessible by UI, so we deleted the link to this job, and everything started working.
So this issue is not related to load but to the jobs that trigger the manual approval.
Thank you very much for your help. We wonder what could delete all workflow approvals from the system without updating a record in public.main_unifiedjobtemplate database.
Everything we have, related to the cleanup is this Cleanup Activity Schedule. Can this kind of schedule cause this issue?
Interesting finds. I still come up empty-handed about why the workflow approvals might have been deleted. The system job you mentioned is about activity stream so I might rule that out. The view should prevent deleting the workflow approval while it's running.
Anyway, I still believe there is value in the refactor I linked earlier, which may avoid these issues even if the bad state happens. It is also valuable to increase processing speed to begin with.
@AlanCoding, thank you for your reply. I am closing this issue for now.
Please confirm the following
Bug Summary
When we run more than 300 workflows simultaneously, the workflows get stuck in the "Running" state. If we get inside the workflow schema, we can see that the first job of the workflow was successfully finished, but the second job was not even started.
We can also see the same behavior when the workflow gets stuck in the middle or even at the end of the workflow (in the last case, we can see that all jobs were successfully finished, but the workflow is still stuck in the "Running" state). We are running our jobs in containers in the awx-workers namespace. If I check the pods in this namespace when this issue happens, I can see that there are no pods at all, but all of the workflows are still "Running" in AWX's UI.
It is worth mentioning that we use 10 AWX replicas for our processes. By the way, we have upgraded the AWX version from 21.2.0 to 21.10.2; we are not sure if it is related because we have never run this amount of workflows before. The EKS version is v1.24.8-eks-ffeb93d
When this issue happens, we get many error logs in the awx-task container that look like this:
AWX version
21.10.2
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
Chrome
Steps to reproduce
Run more than 300 workflows in AWX
Expected results
All workflows are finished successfully
Actual results
Workflows are stuck in the "Running" state
Additional information
No response