ScaleComputing / HyperCoreAnsibleCollection

Official Ansible collection for Scale Computing SC//HyperCore (HC3) v1 API
GNU General Public License v3.0
12 stars 8 forks source link

:lady_beetle: Bug: update_status_check.yml is not surviving upgrade node reboot #261

Closed ddemlow closed 10 months ago

ddemlow commented 10 months ago

Describe the bug

my playbook for multi node cluster updates is using this task to re-use the status check from the version_update_single_node role

but it appears it's not surviving the time when node is actually down - no update response at all during a reboot - do I need ignore_unreachable: true in above task or some kind of retry there? or should rule be handling this? (note upgrade is still running when error below is thrown)

TASK [hypercore_version : apply desired version to cluster or SNS] *** changed: [veb120a-01.lab.local] Friday 18 August 2023 07:09:42 -0400 (0:00:04.309) 0:00:25.398 *

TASK [scale_computing.hypercore.version_update_single_node : Increment version_update_single_node_retry_count] *** ok: [veb120a-01.lab.local] Friday 18 August 2023 07:09:42 -0400 (0:00:00.063) 0:00:25.462 *

TASK [scale_computing.hypercore.version_update_single_node : Pause before checking update status - checks will report FAILED-RETRYING until update COMPLETE/TERMINATED] * ok: [veb120a-01.lab.local -> localhost] Friday 18 August 2023 07:10:43 -0400 (0:01:00.866) 0:01:26.329 *** FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (100 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (99 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (98 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (97 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (96 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (95 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (94 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (93 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (92 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (91 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (90 retries left). FAILED - RETRYING: [veb120a-01.lab.local]: Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED (89 retries left).

TASK [scale_computing.hypercore.version_update_single_node : Check update status - will report FAILED-RETRYING until update COMPLETE/TERMINATED] ***** fatal: [veb120a-01.lab.local]: FAILED! => {"msg": "The conditional check 'version_update_single_node_update_status.record != None and (\n version_update_single_node_update_status.record.update_status == \"COMPLETE\" or\n version_update_single_node_update_status.record.update_status == \"TERMINATING\"\n)' failed. The error was: error while evaluating conditional (version_update_single_node_update_status.record != None and (\n version_update_single_node_update_status.record.update_status == \"COMPLETE\" or\n version_update_single_node_update_status.record.update_status == \"TERMINATING\"\n)): 'dict object' has no attribute 'record'"} ...ignoring

PLAY RECAP *** veb120a-01.lab.local : ok=13 changed=1 unreachable=0 failed=0 skipped=2 rescued=0 ignored=1

To Reproduce calling this role https://github.com/ddemlow/ansible_edge_playbooks/blob/master/roles/hypercore_version/tasks/main.yml

Expected behavior

update monitoring should continue through entire cluster update even when node reboots

Screenshots

If applicable, add screenshots to help explain your problem.

System Info (please complete the following information):

Additional context

Add any other context about the problem here.

ddemlow commented 10 months ago

appears this may have been caused by having ignore_errors: true set in the initial playbook I called the role from (prevented rescue loop from executing) -