ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.06k stars 3.42k forks source link

job failed with zipfile.BadZipFile: File is not a zip file error from time to time #12343

Open mick1627 opened 2 years ago

mick1627 commented 2 years ago

Please confirm the following

Summary

From time to time some jobs failed with the following error :

image

The job has the ID 818890, here the logs os awx_task pod

2022-06-08 08:59:57,895 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 acknowledged
2022-06-08 08:59:57,998 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.main.scheduler Spawned project_update 818891 (pending) as dependency of job 818890 (pending)
2022-06-08 08:59:58,099 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.main.scheduler Spawned inventory_update 818892 (pending) as dependency of job 818890 (pending)
2022-06-08 08:59:58,252 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 blocked by projectupdate-818891
2022-06-08 09:00:02,114 DEBUG    [51e06205750f454dbfe891f2d1d42b36] awx.analytics.job_lifecycle job-818890 blocked by projectupdate-818891
2022-06-08 09:00:06,457 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 blocked by inventoryupdate-818892
2022-06-08 09:00:17,519 DEBUG    [caed80d3bec34e74a35bbf8ad76835e6] awx.analytics.job_lifecycle job-818890 blocked by inventoryupdate-818892
2022-06-08 09:00:20,532 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 controller node chosen
2022-06-08 09:00:20,542 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 waiting
2022-06-08 09:00:20,543 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.main.scheduler job 818890 (waiting) consumed 6 capacity units from default with prior total of 6
2022-06-08 09:00:20,658 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.main.dispatch task 83173b34-9278-46e2-83cf-f6bac77638c5 starting awx.main.tasks.RunJob(*[818890])
2022-06-08 09:00:20,944 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 pre run
2022-06-08 09:00:20,967 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.main.tasks Skipping project sync for job 818890 (running) because commit is locally available
2022-06-08 09:00:26,243 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 preparing playbook
2022-06-08 09:00:26,465 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 running playbook
2022-06-08 09:01:01,616 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 work unit id received
2022-06-08 09:01:01,729 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 work unit id assigned
2022-06-08 09:01:04,360 INFO     [bd8cd35bcf2346ab948cf830785023cc] awx.main.commands.run_callback_receiver Event processing is finished for Job 818890, sending notifications
2022-06-08 09:01:04,360 INFO     [bd8cd35bcf2346ab948cf830785023cc] awx.main.commands.run_callback_receiver Event processing is finished for Job 818890, sending notifications
2022-06-08 09:01:07,887 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.main.tasks job 818890 (running) finished running, producing 0 events.
2022-06-08 09:01:07,895 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 post run
2022-06-08 09:01:08,212 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.analytics.job_lifecycle job-818890 finalize run
2022-06-08 09:01:08,443 WARNING  [bd8cd35bcf2346ab948cf830785023cc] awx.main.dispatch job 818890 (error) encountered an error (rc=None), please see task stdout for details.
2022-06-08 09:01:08,446 DEBUG    [bd8cd35bcf2346ab948cf830785023cc] awx.main.tasks Executing error task id 83173b34-9278-46e2-83cf-f6bac77638c5, subtasks: [{'type': 'job', 'id': 818890}]

In the UI the last modified is 11:00:20 AM (CET) or 9:00:20 UTC image

but we can see in the logs of awx_task that the job finished at 09:01:07.

AWX version

19.5.0

Select the relevant components

Installation method

kubernetes

Modifications

yes

Ansible version

2.12

Operating system

No response

Web browser

No response

Steps to reproduce

Launch job template in AWX, from time to time it failed with this zipfile.BadZipFile: File is not a zip file error

Expected results

Job launch correctly

Actual results

job failed with zipfile.BadZipFile error

Additional information

WE use custom awx EE.

fosterseth commented 2 years ago

We think this bug was fixed in receptor 1.2.3, please make sure this receptor version is present in your awx ee image

mick1627 commented 2 years ago

We think this bug was fixed in receptor 1.2.3, please make sure this receptor version is present in your awx ee image

I have quay.io/ansible/receptor:devel in my ee repo, and the receptor version is 1.2.0+gce6c620. I will update it to use quay.io/ansible/receptor:latest to have version 1.2.3.

I'm closing this issue and reopen it if I still have this issue with receptor 1.2.3.

mick1627 commented 2 years ago

I'm reopening this issue, I still have it after the update of AWX and our custom EE :

I'm using AWX 21.1.0 Custom EE with receptor 1.2.3.

receptor --version
1.2.3
shanemcd commented 2 years ago

Did you run the above command inside of the control plane EE or inside of a pod for a running job?

mick1627 commented 2 years ago

I am using the same docker image for the control plane EE and the pod where the job is running

shanemcd commented 2 years ago

Can you share any information about the workload? How many hosts, how many tasks, how much stdout, are you gathering a lot of facts? Anything you can share might be helpful.

mick1627 commented 2 years ago

It happens randomly in different kind of job template. For the last issue, it happens on a job part of a workflow. The job was launch on 1 host, about 15 tasks and no gather facts.

stanislav-zaprudskiy commented 2 years ago

This could be https://github.com/ansible/ansible-runner/issues/998

shanemcd commented 1 year ago

This might have been fixed with https://github.com/ansible/ansible-runner/pull/1161

coudenysj commented 11 months ago

We still see this error quite a lot on our setup.

2023-11-30 12:37:28,292 ERROR    [fa7348566d2e4873932a160255a1a0cd] awx.main.tasks.jobs job 161809 (running) Exception occurred while running task
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 605, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 318, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 393, in _run_internal
    res = processor_future.result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py", line 1199, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 461, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py", line 213, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 361, in run
    self.artifacts_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 335, in artifacts_callback
    unstream_dir(self._input, length, self.artifact_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/streaming.py", line 78, in unstream_dir
    with zipfile.ZipFile(tmp.name, "r") as archive:
  File "/usr/lib64/python3.9/zipfile.py", line 1266, in __init__
    self._RealGetContents()
  File "/usr/lib64/python3.9/zipfile.py", line 1333, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

ansible-runner --version 2.3.3

The code from https://github.com/ansible/ansible-runner/pull/1162/files is there, but we still get this error.

swiedernix commented 9 months ago

same problem came up in our setup: AWX 23.5.1 ansible_runner 2.3.4 k3s: v1.28.4+k3s2

coudenysj commented 9 months ago

Is there any suggestion to fix this issue, because we get this quite often:

/jobs/174889 2024-02-07T05:10:17.720127Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174888 2024-02-07T05:10:17.631557Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174887 2024-02-07T05:00:27.010690Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174879 2024-02-07T04:40:26.852807Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174872 2024-02-07T04:20:26.397516Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174871 2024-02-07T04:20:26.111226Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174867 2024-02-07T04:00:26.542494Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174864 2024-02-07T03:55:41.692424Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174857 2024-02-07T03:20:26.090661Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174856 2024-02-07T03:20:25.799524Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174854 2024-02-07T03:00:26.420650Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174851 2024-02-07T02:59:55.172001Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174847 2024-02-07T02:40:25.808439Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174845 2024-02-07T02:20:26.445563Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174841 2024-02-07T02:00:27.005234Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174840 2024-02-07T02:00:26.750985Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174839 2024-02-07T02:00:26.401924Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174802 2024-02-06T14:08:10.237412Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174778 2024-02-06T12:57:13.975563Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174745 2024-02-06T11:32:42.871325Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174665 2024-02-06T08:26:04.317739Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174653 2024-02-06T08:04:31.230036Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174619 2024-02-06T08:04:16.185793Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174585 2024-02-06T05:00:26.304256Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174563 2024-02-06T03:00:27.215729Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174559 2024-02-06T02:40:26.593133Z error (`zipfile.BadZipFile: File is not a zip file`)
/jobs/174552 2024-02-06T02:00:34.138571Z error (`zipfile.BadZipFile: File is not a zip file`)
blundey commented 7 months ago

Same issue on latest version -

BadZipFile("File is not a zip file") - on random job runs. We run the same job on 300+ hosts and it may occur once in 300 times...

Any devs got time to check this?

fosterseth commented 7 months ago

@blundey

usually this relates to issues around log rotation

you can bump up your max log container size for your cluster

you should also enable reconnect

https://github.com/ansible/receptor/pull/683