Closed sparkacus closed 5 years ago
👍 This is also happening to me. I also created a second project because updating the project was blocked by this. I had been running on awx 1.0.1.0 and this issue persisted across an upgrade to 1.0.2.0 for me.
Also happening here. When I try to cancel a task it wont cancel. Also tried via tower-cli and via the API. Can't Cancel or Delete.
I think it got stuck because of too less memory available:
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: OSError: [Errno 12] Cannot allocate memory 12 fatal: [hostname]: FAILED! => {"failed": true, "msg": "Unexpected failure during module execution.", "stdout": ""}
+1 Happening to me too
same here, can you please extend the tower-cli to allow interface with inventory sync jobs ?
[Errno 12] Cannot allocate memory 12 fatal
It seems like you might not have enough memory? Low memory can have pretty dramatically bad effects... how much is allocated to AWX?
We have 8GB RAM assigned.
I had this issue start on a machine with 8G and have since migrated it to a system with 16G of RAM. My job has been stuck for 797Hrs so far.
I had a similar issue after upgrading to the latest awx version. Before i did the update i've stopped the all containers while a job was running. After the update was done, the job was hanging in running state. I was unable to cancel or delete the job via gui or api.
My solution was to delete the job via "awx-manage shell_plus".
# docker exec -it awx_web bash
# awx-manage shell_plus
>>> from awx.main.models import UnifiedJob
>>> job_id = 28766
>>> job = UnifiedJob.objects.get(id=job_id)
>>> job.delete()
(20, {u'main.ActivityStream_job': 1, u'main.UnifiedJob': 1, u'main.JobOrigin': 1, u'main.UnifiedJob_credentials': 2, u'main.Job': 1, u'main.JobEvent': 10, u'main.UnifiedJob_dependent_jobs': 4, u'main.ActivityStream_unified_job': 0, u'main.JobEvent_hosts': 0})
After doing that, all scheduled jobs continued to run.
Hope it helps!
to delete all pending jobs:
>>> from awx.main.models import UnifiedJob
>>> map(lambda x: x.delete(), UnifiedJob.objects.filter(status='pending'))
edit: changed from Job to UnifiedJob
Hmm, I can't get it to work @st0ne-dot-at
>>> from awx.main.models import Job
>>> job_id = 208
>>> jobs = Job.objects.get(id=job_id)
>>> job.delete()
Traceback (most recent call last):
File "<console>", line 1, in <module>
NameError: name 'job' is not defined
>>> from awx.main.models import Job
>>> job_id = 208
>>> jobs = Job.objects.get(id=job_id)
>>> Job.delete()
Traceback (most recent call last):
File "<console>", line 1, in <module>
TypeError: unbound method delete() must be called with Job instance as first argument (got nothing instead)
@st0ne-dot-at Found it,
job.delete()
should be jobs.delete()
Thanks!
@st0ne-dot-at thank you, thank you & thank you
Same here, Job not been finished since 64 hours and cant't be cancelled from UI/API. WOrking method provided by @st0ne-dot-at. Any news on this issue? If a job stucks it blocks all other future jobs and makes the environment unusable. Can't understand priority:medium..
@st0ne-dot-at Many thanks ! I had exactly the same issue with a Workflow and a prompt for password entry to be set. You saved me a few hours of work!
Hi everybody.
I'm having a similar problem: new jobs are not running in AWX. However, according to the task manager and to the API, no jobs are running at the moment, so I cannot apply the work around posted by @st0ne-dot-at.
The startup of the awx_task node does not report any error or failure. However, no job can be executed in AWX.
Does anybody have the same problem?
Below it can be found the awx_task logs:
RESULT 2 OKREADY [2018-03-28 08:45:05,636: INFO/Beat] Scheduler: Sending due task task_manager (awx.main.scheduler.tasks.run_task_manager) [2018-03-28 08:45:05,639: DEBUG/Beat] awx.main.scheduler.tasks.run_task_manager sent. id->9a2a15a4-c33b-4ad2-a5db-afeb1a251d2d [2018-03-28 08:45:05,639: DEBUG/Beat] beat: Waking up in 9.58 seconds. [2018-03-28 08:45:15,233: INFO/Beat] Scheduler: Sending due task tower_scheduler (awx.main.tasks.awx_periodic_scheduler) [2018-03-28 08:45:15,236: DEBUG/Beat] awx.main.tasks.awx_periodic_scheduler sent. id->adb8cf9b-b68b-456b-b168-aa578924dbeb [2018-03-28 08:45:15,237: DEBUG/Beat] beat: Waking up in 10.39 seconds. [2018-03-28 08:45:23,173: INFO/MainProcess] Scaling down -2 processes. [2018-03-28 08:45:23,174: DEBUG/MainProcess] basic.qos: prefetch_count->536 [2018-03-28 08:45:25,637: INFO/Beat] Scheduler: Sending due task task_manager (awx.main.scheduler.tasks.run_task_manager) [2018-03-28 08:45:25,639: DEBUG/Beat] awx.main.scheduler.tasks.run_task_manager sent. id->e51ff769-21e4-4fd4-b883-e3bc28296478 [2018-03-28 08:45:25,639: DEBUG/Beat] beat: Waking up in 19.58 seconds. [2018-03-28 08:45:45,233: INFO/Beat] Scheduler: Sending due task cluster_heartbeat (awx.main.tasks.cluster_node_heartbeat) [2018-03-28 08:45:45,235: DEBUG/Beat] awx.main.tasks.cluster_node_heartbeat sent. id->21a9db25-0528-4c30-a284-67788843c3cf [2018-03-28 08:45:45,236: INFO/Beat] Scheduler: Sending due task tower_scheduler (awx.main.tasks.awx_periodic_scheduler) [2018-03-28 08:45:45,237: DEBUG/Beat] awx.main.tasks.awx_periodic_scheduler sent. id->884a8138-b1cf-4141-98b2-847d2d3597da [2018-03-28 08:45:45,237: INFO/MainProcess] Received task: awx.main.tasks.cluster_node_heartbeat[21a9db25-0528-4c30-a284-67788843c3cf] expires:[2018-03-28 08:46:35.234702+00:00] [2018-03-28 08:45:45,237: DEBUG/Beat] beat: Waking up now. [2018-03-28 08:45:45,238: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x2812aa0> (args:(u'awx.main.tasks.cluster_node_heartbeat', u'21a9db25-0528-4c30-a284-67788843c3cf', [], {}, {u'utc': True, u'is_eager': False, u'chord': None, u'group': None, u'args': [], u'retries': 0, u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'awx', u'exchange': u'awx'}, u'expires': u'2018-03-28T08:46:35.234702+00:00', u'hostname': 'celery@awx', u'task': u'awx.main.tasks.cluster_node_heartbeat', u'callbacks': None, u'correlation_id': u'21a9db25-0528-4c30-a284-67788843c3cf', u'errbacks': None, u'timelimit': [None, None], u'taskset': None, u'kwargs': {}, u'eta': None, u'reply_to': u'5180b35a-9059-3cdd-8c3a-3d9ec1df0241', u'id': u'21a9db25-0528-4c30-a284-67788843c3cf', u'headers': {}}) kwargs:{}) [2018-03-28 08:45:45,241: DEBUG/MainProcess] Task accepted: awx.main.tasks.cluster_node_heartbeat[21a9db25-0528-4c30-a284-67788843c3cf] pid:262 2018-03-28 08:45:45,253 DEBUG awx.main.tasks Cluster node heartbeat task. [2018-03-28 08:45:45,279: DEBUG/Worker-48] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL. See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2018 Pivotal Software, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'connection.blocked': True, u'authentication_failure_close': True, u'direct_reply_to': True, u'basic.nack': True, u'per_consumer_qos': True, u'consumer_priorities': True, u'consumer_cancel_notify': True, u'publisher_confirms': True}, u'cluster_name': u'rabbit@31be0b50a313', u'platform': u'Erlang/OTP 20.2.4', u'version': u'3.7.4'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US'] [2018-03-28 08:45:45,281: DEBUG/Worker-48] Open OK! [2018-03-28 08:45:45,282: DEBUG/Worker-48] using channel_id: 1 [2018-03-28 08:45:45,283: DEBUG/Worker-48] Channel open [2018-03-28 08:45:45,286: INFO/MainProcess] Received task: awx.main.tasks.handle_ha_toplogy_changes[377f726d-5bcf-49e2-b122-ac010b14cfd5] [2018-03-28 08:45:45,286: INFO/MainProcess] Scaling up 1 processes. [2018-03-28 08:45:45,294: DEBUG/Worker-48] Closed channel #1 [2018-03-28 08:45:45,475: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x2812aa0> (args:(u'awx.main.tasks.handle_ha_toplogy_changes', u'377f726d-5bcf-49e2-b122-ac010b14cfd5', [], {}, {u'utc': True, u'is_eager': False, u'chord': None, u'group': None, u'args': [], u'retries': 0, u'delivery_info': {u'priority': None, u'redelivered': False, u'routing_key': u'', u'exchange': u'tower_broadcast_all'}, u'expires': None, u'hostname': 'celery@awx', u'task': u'awx.main.tasks.handle_ha_toplogy_changes', u'callbacks': None, u'correlation_id': u'377f726d-5bcf-49e2-b122-ac010b14cfd5', u'errbacks': None, u'timelimit': [None, None], u'taskset': None, u'kwargs': {}, u'eta': None, u'reply_to': u'8f348d0d-2f30-3d19-94ea-36c7edf8f574', u'id': u'377f726d-5bcf-49e2-b122-ac010b14cfd5', u'headers': {}}) kwargs:{}) [2018-03-28 08:45:45,477: DEBUG/MainProcess] basic.qos: prefetch_count->540 [2018-03-28 08:45:45,483: INFO/MainProcess] Task awx.main.tasks.cluster_node_heartbeat[21a9db25-0528-4c30-a284-67788843c3cf] succeeded in 0.244400362484s: None [2018-03-28 08:45:45,486: DEBUG/MainProcess] Task accepted: awx.main.tasks.handle_ha_toplogy_changes[377f726d-5bcf-49e2-b122-ac010b14cfd5] pid:262 2018-03-28 08:45:45,500 DEBUG awx.main.tasks Reconfigure celeryd queues task on host celery@awx [2018-03-28 08:45:45,529: DEBUG/Worker-48] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL. See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2018 Pivotal Software, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'connection.blocked': True, u'authentication_failure_close': True, u'direct_reply_to': True, u'basic.nack': True, u'per_consumer_qos': True, u'consumer_priorities': True, u'consumer_cancel_notify': True, u'publisher_confirms': True}, u'cluster_name': u'rabbit@31be0b50a313', u'platform': u'Erlang/OTP 20.2.4', u'version': u'3.7.4'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US'] [2018-03-28 08:45:45,535: DEBUG/Worker-48] Open OK! [2018-03-28 08:45:45,535: DEBUG/Worker-48] using channel_id: 1 [2018-03-28 08:45:45,537: DEBUG/Worker-48] Channel open [2018-03-28 08:45:45,547: DEBUG/MainProcess] pidbox received method active_queues() [reply_to:{u'routing_key': u'06a007df-cc12-3eb2-baac-0f79bbfe23eb', u'exchange': u'reply.celery.pidbox'} ticket:2013eb53-69ba-4c3f-856b-2d783712c263] [2018-03-28 08:45:45,555: DEBUG/Worker-48] Closed channel #1 2018-03-28 08:45:45,558 INFO awx.main.tasks Workers on tower node 'awx' removed from queues [] and added to queues [] 2018-03-28 08:45:45,561 INFO awx.main.tasks Worker on tower node 'awx' updated celery routes {'awx.main.tasks.purge_old_stdout_files': {'queue': 'awx', 'routing_key': 'awx'}, 'awx.main.tasks.cluster_node_heartbeat': {'queue': 'awx', 'routing_key': 'awx'}} all routes are now {'awx.main.tasks.purge_old_stdout_files': {'queue': 'awx', 'routing_key': 'awx'}, 'awx.main.tasks.cluster_node_heartbeat': {'queue': 'awx', 'routing_key': 'awx'}} [2018-03-28 08:45:45,566: INFO/MainProcess] Task awx.main.tasks.handle_ha_toplogy_changes[377f726d-5bcf-49e2-b122-ac010b14cfd5] succeeded in 0.0819727219641s: None [2018-03-28 08:45:45,631: DEBUG/Beat] beat: Synchronizing schedule... [2018-03-28 08:45:45,638: INFO/Beat] Scheduler: Sending due task task_manager (awx.main.scheduler.tasks.run_task_manager) [2018-03-28 08:45:45,639: DEBUG/Beat] awx.main.scheduler.tasks.run_task_manager sent. id->e9444a1a-2252-4854-b44e-7777a3298982 [2018-03-28 08:45:45,640: DEBUG/Beat] beat: Waking up in 19.99 seconds.
@daviarpfc same. Nothing shown running, then the job just hangs until i cancel it. Inventory scripts won't sync either.
@st0ne-dot-at from console, do you know if there is anyway to safely delete all jobs, and all inventory? My Inventory is stuck in this state
I basically want to purge almost everything, but the data model looks complex and I'm afraid of orphaning/corrupting
This, or a symptomatically similar issue started happening to me after my upgrade from AWX 1.0.2 to 1.0.5, and I discovered that it was the result of my default instance group no longer using the local awx_task
container. It's possible that this is configured correctly in new versions, but that it is necessary in upgrading to configure the default instance group.
After finding this, I edited the instance group, clicked the +
, and added the already-configured instance called awx
. This marked the group as available, and caused all of my stuck jobs to resume.
Edit: After restarting a number of times, due to another configuration issue, I found it was necessary to repeat my process to add the awx_task
instance again. I suspect that my process is wrong: instead, it may be right to add awx
to the Policy instance list
for the instance group, so that it is automatically added. (and then add it manually the first time.)
Edit 2: I found this issue persisted after changing the Policy instance list
, which was cleared during my nightly offline backups. That's weird and concerning, but I've worked around it by changing the Policy instance minimum
to 1
and Policy instance percentage
to 100%
, and that looks good so far.
I'm having the same issue, "Pending Delete" Erorr in red.
has anyone found a solution?
I tried instance groups and tried upgrading, deleting all jobs to no avail. Still the same error and I'm locked up, no new jobs are executing.
Are people still seeing this with 1.0.6?
@wenottingham using Tower 3.2.5 the issue remains but I don't have the bandwidth to test 1.0.6 right now.
We were able to trigger this bug due to an inadvertent change of our rabbitmq security groups, what happens from there is a cascading failure that involves first removing the mnesia tables for each host in the cluster and/or depending on how this has happened blowing away the entire queue via the management plugin. If one comes across this early then you can just delete the job(s) but if you have a lot of scheduled tasks before you find it, you're sol.
After which it's best to run a full configuration of awx again to ensure all settings are correct for rabbitmq and restarting the service. This bug's priority certainly needs to be bumped though because automated tasks are essentially stopped and the cascading failure means work will be lost. It's even worst if the worker nodes or the tower nodes themselves have work intensive or time sensitive jobs; it's a complete failure. Priority really should be bumped.
My issue was auto resolved after waiting just over 48hrs. Something got auto reset to correct this issue, not sure what. Maybe rabbitmq does a clean up of dead queues after a set time?
Also, I did stop and start each service individually. That may have corrected it?
@iis-software if you just did a stop and start and things got working again that is actually a different bug where the job paused. This happens intermittently due to the stack and could either be celery and amqp issues or network issues or a combination, etc, etc.
If you're having issues with Tower 3.2.x, please contact your Red Hat support rep... this issue is for AWX.
I'm not sure what difference it makes, it's the same bug, we will be contacting about another issue however. Or are you saying this bug will be tracked differently?
@wenottingham I am seeing this issue on a fresh install of awx 1.0.6.8
Not sure if it's the same but I got a job stuck, blocking anything else including project and inventory sync, when I launched the template from the playbook with vars_prompt
. It seems that Ansible is waiting for prompt input forever and that's why it's stuck.
Same here, I also got this problem with playbook that waiting for prompt. Cannot cancel the job 😟
The same problem, awx 1.0.16.6, ansible 2.5.5
For now, I alway restart my ansible tower service to get it working, again. I think, the problem here seems to be rabitmq is stuck waiting to timeout. The way to correct it I think in rabittmq they should change the BROKER_HEARTBEAT to something > 0. Rabittmq through each heartbeat checks for hung states and clears it, rather than waiting for timeout, but needs to be fully tested.
-Mitchell
On Jun 17, 2018, at 6:11 AM, Oleg Popov notifications@github.com wrote:
The same problem, awx 1.0.16.6, ansible 2.5.5
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Same problem here, AWX 1.0.6.5 running in a kubernetes cluster.
To delete stand-alone jobs stuck in "Running" status for AWX running on Kubernetes, I found this worked for me:
docker exec -it awx_web bash
awx-manage shell_plus
from awx.main.models import UnifiedJob
unified_job_obj=UnifiedJob()
unified_job_obj.id=ENTERJOBID
unified_job_obj.delete()
We've just released 1.0.7, which we believe resolves the underlying issue here. You can try it out here: https://github.com/ansible/awx/releases/tag/1.0.7
Let us know if you're still seeing this issue after installing the latest awx - thanks!
Hello,
I still have the same issue, BUT I can cancel the job via web interface.
AWX 1.0.7.2
Ansible 2.6.2
I think Ansible AWX does not support vars_prompt. I have the same problem. Can you tell me what the alternative method I can use ??
@tedfernandess you are correct it does not support vars_prompt. However by using the adding a survey to the job template you can prompt the user for vars at job execution time.
https://docs.ansible.com/ansible-tower/latest/html/userguide/job_templates.html#surveys
On 4/09/2018 12:10:14 AM, tedfernandess notifications@github.com wrote: I think Ansible AWX does not support varsprompt. I have the same problem. Can you tell me what the alternative method I can use ?? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [https://github.com/ansible/awx/issues/888#issuecomment-418096790], or mute the thread [https://github.com/notifications/unsubscribe-auth/AD8N4B1pC-FlSzEh9eiplXS3ORvEQh20ks5uXRwjgaJpZM4RNgH].
No follow up on underlying issue. If you have discussions about other topics (like vars_prompt) please take them to the mailing list or irc.
@ryanpetrello @matburt AWX version: 2.1.2
I am also still facing the same issue: I have created a new project (Git) and used its inventory file (.ini) as the source for an inventory. Now, after starting the inventory sync the sync jobs is stuck (stays as Running
) after producing the output:
2.548 INFO Updating inventory 4: Servers
2.565 INFO Reading Ansible inventory source: /var/lib/awx/projects/_6__demo/inventory/dev/demo.ini
However, it doesn't seem to have synced the inventory because cancelling the sync and running any playbook using it results in no hosts at all.
got this issue last week and today, running AWX 23.8.1
ISSUE TYPE
COMPONENT NAME
SUMMARY
Job stuck and cannot be cancelled. When cancelling the job, nothing happens and the job is still running.
Other job templates that use the same inventory are held in a queue and blocked from running.
I'm unable to sync the inventory in question either, it just hangs with no output.
To work around this I had to create a new inventory, assign all templates to the new inventory before I could run them, which isn't ideal.
ENVIRONMENT
AWX version: latest AWX install method: docker on linux Ansible version: 2.4 Operating System: Linux
ADDITIONAL INFORMATION
If anyone could please help me force cancel this job it would be great.
The AWX server is busy so it's hard to know if what I'm seeing in the log in realtime whilst trying to cancel a task is related. Maybe this relates:
[2017-12-27 13:28:31,207: INFO/MainProcess] Scheduler: Sending due task task_manager (awx.main.scheduler.tasks.run_task_manager) [2017-12-27 13:28:31,210: DEBUG/MainProcess] awx.main.scheduler.tasks.run_task_manager sent. id->9112f3b1-adee-42d6-ab9a-d490504ca624 [2017-12-27 13:28:31,210: DEBUG/MainProcess] beat: Waking up in 19.99 seconds. [2017-12-27 13:28:31,215: INFO/MainProcess] Received task: awx.main.scheduler.tasks.run_task_manager[9112f3b1-adee-42d6-ab9a-d490504ca624] expires:[2017-12-27 13:28:51.209264+00:00] [2017-12-27 13:28:31,215: INFO/MainProcess] Scaling down 1 processes. [2017-12-27 13:28:31,216: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x3095c80> (args:('awx.main.scheduler.tasks.run_task_manager', '9112f3b1-adee-42d6-ab9a-d490504ca624', {'origin': 'gen98@awx', 'lang': 'py', 'task': 'awx.main.scheduler.tasks.run_task_manager', 'group': None, 'root_id': '9112f3b1-adee-42d6-ab9a-d490504ca624', u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'tower', u'exchange': u''}, 'expires': '2017-12-27T13:28:51.209264+00:00', u'correlation_id': '9112f3b1-adee-42d6-ab9a-d490504ca624', 'retries': 0, 'timelimit': [None, None], 'argsrepr': '()', 'eta': None, 'parent_id': None, u'reply_to': '4ccaf1dd-88bd-3bce-a993-29e1f2156d88', 'id': '9112f3b1-adee-42d6-ab9a-d490504ca624', 'kwargsrepr': '{}'}, u'[[], {}, {"chord": null, "callbacks": null, "errbacks": null, "chain": null}]', 'application/json', 'utf-8') kwargs:{}) [2017-12-27 13:28:31,216: DEBUG/MainProcess] basic.qos: prefetch_count->96 [2017-12-27 13:28:31,232: DEBUG/MainProcess] Task accepted: awx.main.scheduler.tasks.run_task_manager[9112f3b1-adee-42d6-ab9a-d490504ca624] pid:1658 2017-12-27 13:28:31,248 DEBUG awx.main.scheduler Running Tower task manager. 2017-12-27 13:28:31,248 DEBUG awx.main.scheduler Running Tower task manager. [2017-12-27 13:28:31,248: DEBUG/ForkPoolWorker-59] Running Tower task manager. 2017-12-27 13:28:31,256 DEBUG awx.main.scheduler Starting Scheduler 2017-12-27 13:28:31,256 DEBUG awx.main.scheduler Starting Scheduler [2017-12-27 13:28:31,256: DEBUG/ForkPoolWorker-59] Starting Scheduler [2017-12-27 13:28:31,413: INFO/ForkPoolWorker-59] Task awx.main.scheduler.tasks.run_task_manager[9112f3b1-adee-42d6-ab9a-d490504ca624] succeeded in 0.181198501028s: None [2017-12-27 13:28:34,242: INFO/MainProcess] Scaling down 1 processes. [2017-12-27 13:28:34,280: DEBUG/MainProcess] heartbeat_tick : for connection e2a1ec09a42b44a094b96282fa6433c2 [2017-12-27 13:28:34,281: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: 400/3485, now - 404/3525, monotonic - 2510462.00238, last_heartbeat_sent - 2510462.00237, heartbeat int. - 60 for connection e2a1ec09a42b44a094b96282fa6433c2