my playbook for multi node cluster updates is using this task to re-use the status check from the version_update_single_node role
name: use update check from sns update role #has inner and outer retry loops
ansible.builtin.import_role:
name: scale_computing.hypercore.version_update_single_node
tasks_from: update_status_check.yml
but it appears it's not surviving the time when node is actually down - no update response at all during a reboot - do I need ignore_unreachable: true in above task or some kind of retry there? or should rule be handling this? (note upgrade is still running when error below is thrown)
TASK [hypercore_version : apply desired version to cluster or SNS] ***
changed: [veb120a-01.lab.local]
Friday 18 August 2023 07:09:42 -0400 (0:00:04.309) 0:00:25.398 *
appears this may have been caused by having ignore_errors: true set in the initial playbook I called the role from (prevented rescue loop from executing) -
Describe the bug
my playbook for multi node cluster updates is using this task to re-use the status check from the version_update_single_node role
but it appears it's not surviving the time when node is actually down - no update response at all during a reboot - do I need ignore_unreachable: true in above task or some kind of retry there? or should rule be handling this? (note upgrade is still running when error below is thrown)
TASK [hypercore_version : apply desired version to cluster or SNS] *** changed: [veb120a-01.lab.local] Friday 18 August 2023 07:09:42 -0400 (0:00:04.309) 0:00:25.398 *
TASK [scale_computing.hypercore.version_update_single_node : Increment version_update_single_node_retry_count] *** ok: [veb120a-01.lab.local] Friday 18 August 2023 07:09:42 -0400 (0:00:00.063) 0:00:25.462 *
TASK [scale_computing.hypercore.version_update_single_node : Pause before checking update status - checks will report FAILED-RETRYING until update COMPLETE/TERMINATED] * ok: [veb120a-01.lab.local -> localhost] Friday 18 August 2023 07:10:43 -0400 (0:01:00.866) 0:01:26.329 *** FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (100 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (99 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (98 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (97 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (96 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (95 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (94 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (93 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (92 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (91 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (90 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (89 retries left).
TASK [scale_computing.hypercore.version_update_single_node : Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED] ***** fatal: [veb120a-01.lab.local]: FAILED! => {"msg": "The conditional check 'version_update_single_node_update_status.record != None and (\n version_update_single_node_update_status.record.update_status == \"COMPLETE\" or\n version_update_single_node_update_status.record.update_status == \"TERMINATING\"\n)' failed. The error was: error while evaluating conditional (version_update_single_node_update_status.record != None and (\n version_update_single_node_update_status.record.update_status == \"COMPLETE\" or\n version_update_single_node_update_status.record.update_status == \"TERMINATING\"\n)): 'dict object' has no attribute 'record'"} ...ignoring
PLAY RECAP *** veb120a-01.lab.local : ok=13 changed=1 unreachable=0 failed=0 skipped=2 rescued=0 ignored=1
To Reproduce calling this role https://github.com/ddemlow/ansible_edge_playbooks/blob/master/roles/hypercore_version/tasks/main.yml
Expected behavior
update monitoring should continue through entire cluster update even when node reboots
Screenshots
If applicable, add screenshots to help explain your problem.
System Info (please complete the following information):
Additional context
Add any other context about the problem here.