ansible / awx-ee

An Ansible execution environment for AWX project
https://quay.io/ansible/awx-ee
Other
133 stars 156 forks source link

awx-ee ansible-runner settings override #80

Open DrackThor opened 3 years ago

DrackThor commented 3 years ago

Hi,

I am currently running AWX 19.2.2 with an execution environment build atop awx-ee:0.5.5. There I'm connecting to a windows machine and running a PS script. After exactly 30min the Job aborts without any errors. I assume this has something to do with the idle_timeout of ansible-runner. So far I could not find any other issue regarding timeout, that's what lead me to the ansible-runner in the first place. Is there a possibility to override the env/settings (increase idle_timeout) when using a awx-ee based execution environment?

Thanks in advance!

shanemcd commented 3 years ago

We would need to pass through the idle_timeout kwarg from AWX to Runner. Arguably this issue should live in the AWX repo.

kladiv commented 3 years ago

Hi @shanemcd i guess i got a similar behaviour on AWX 19.2.1 (EE 0.4.0)

I deployed a playbooks that run a task like below:

  raw: "ps -ef | grep -w /opt/xensource/sm/LVMoISCSISR | grep -v grep | grep -wq vdi_delete"
  register: quiesce_ps
  failed_when: false
  until: quiesce_ps.rc == 1
  retries: "{{ quiesce_wait_max_retries }}"
  delay: "{{ quiesce_wait_retries_delay }}"
  become: no
  delegate_to: "{{ xen_pool_master_inventory_hostname }}"

or (changed for test and to check if got same error):

  shell: >-
    RC=0;
    while [ $RC -eq 0 ]; do
      sleep 60;
      ps -ef | grep -w /opt/xensource/sm/LVMoISCSISR | grep -v grep | grep -wq vdi_delete;
      RC=$?;
    done
  register: quiesce_status
  async: 10800 # 3 hrs
  poll: 60 # 1 min
  become: no
  delegate_to: "{{ xen_pool_master_inventory_hostname }}"

Both the until/retries/delay task and async/poll task fails the Job without any error after about 4hrs. Every time i run, it fails after 4 hrs.

image

image

Another playbook task (it makes XenServer big VM export via command module) fails the Job after about 14hrs without any error:

image

I tried to put AWX Job Timeout different from zero (unlimited) to an high value... but same behaviour/job failure. Could it be related to EE and Ansible Runner?

Thank you