Cray-HPE / sat

System Admin Toolkit
https://cray-hpe.github.io/docs-sat/
MIT License
4 stars 5 forks source link

CRAYSAT-1859: Improve BOS failure handling in `sat bootsys` #241

Closed haasken-hpe closed 3 months ago

haasken-hpe commented 3 months ago

Summary and Scope

If sat bootsys fails to create a BOS session for a BOS session template, it will still attempt to check the status of a non-existent BOS session, which results in a TypeError traceback when attempting to get information about a BOS session with None as its ID.

Issues and Related PRs

Testing

Tested on:

Test description:

Tested on rocket by passing in a non-existent BOS session template to sat bootsys boot --stage bos-operations. Tested with and without this fix to see the difference. With this fix, a traceback no longer occurs.

Tested also with a valid BOS session template to ensure it still works as it should, waiting on the BOS session which is created.

See output in comments.

Risks and Mitigations

This is a pretty low-risk change as it only affects whether we attempt to wait on failed BOS sessions and re-orders an info log message.

Pull Request Checklist

haasken-hpe commented 3 months ago

Here are my testing results.

First, here is the traceback that occurs with an invalid session template before this change:

ncn-m001:~ # sat bootsys boot --stage bos-operations --bos-templates haasken-does-not-exist
INFO: Using session templates provided by --bos-templates/bos_templates option: ['haasken-does-not-exist']
INFO: Started boot operation on BOS session template: haasken-does-not-exist.
INFO: Waiting up to 900 seconds for session to complete.
INFO: Waiting for BOS session None to reach target state complete. Session template: haasken-does-not-exist
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/sat/venv/lib/python3.9/site-packages/sat/cli/bootsys/bos.py", line 558, in run
    self.monitor_status()
  File "/sat/venv/lib/python3.9/site-packages/sat/cli/bootsys/bos.py", line 514, in monitor_status
    waiter.wait_for_completion()
  File "/sat/venv/lib/python3.9/site-packages/sat/waiting.py", line 169, in wait_for_completion
    self._wait_polling_loop()
  File "/sat/venv/lib/python3.9/site-packages/sat/waiting.py", line 148, in _wait_polling_loop
    self.completed = self.has_completed()
  File "/sat/venv/lib/python3.9/site-packages/sat/cli/bootsys/bos.py", line 249, in has_completed
    self.session_status = self.bos_client.get_session_status(
  File "/sat/venv/lib/python3.9/site-packages/sat/apiclient/bos.py", line 162, in get_session_status
    return self.get(self.session_path, session_id, 'status').json()
  File "/sat/venv/lib/python3.9/site-packages/csm_api_client/service/gateway.py", line 207, in get
    r = self._make_req(*args, req_type='GET', req_param=params, **kwargs)
  File "/sat/venv/lib/python3.9/site-packages/csm_api_client/service/gateway.py", line 153, in _make_req
    self.base_resource_path, '/'.join(args)), '', '', ''))
TypeError: sequence item 1: expected str instance, NoneType found
ERROR: Operation 'boot' failed on BOS session template 'haasken-does-not-exist': Failed to create BOS session: POST request to URL 'https://api-gw-service-nmn.local/apis/bos/v2/sessions' failed with status code 400: Bad Request
ERROR: Boot failed or timed out for session template: haasken-does-not-exist

Now here is the cleaned up error message with this change in place:

ncn-m001:~ # sat --version
sat 3.28.12
ncn-m001:~ # sat bootsys boot --stage bos-operations --bos-templates haasken-does-not-exist
INFO: Using session templates provided by --bos-templates/bos_templates option: ['haasken-does-not-exist']
INFO: Starting boot operation on BOS session template: haasken-does-not-exist.
INFO: Waiting up to 900 seconds for session to complete.
ERROR: Operation 'boot' failed on BOS session template 'haasken-does-not-exist': Failed to create BOS session: POST request to URL 'https://api-gw-service-nmn.local/apis/bos/v2/sessions' failed with status code 400: Bad Request
ERROR: Boot failed or timed out for session template: haasken-does-not-exist
ncn-m001:~ # echo $?
1

And here we can see it still works with a valid session template as well:

ncn-m001:~ # sat bootsys reboot --stage bos-operations --bos-templates compute-24.4.1.x86_64-596713 --bos-limit x9000c1s0b0n0
Proceed with reboot of compute nodes and UANs using BOS? [yes,no] yes
Proceeding with reboot of compute nodes and UANs using BOS.
INFO: Using session templates provided by --bos-templates/bos_templates option: ['compute-24.4.1.x86_64-596713']
INFO: Starting reboot operation on BOS session template: compute-24.4.1.x86_64-596713.
INFO: Waiting up to 1500 seconds for session to complete.
INFO: Waiting for BOS session a8988ad0-41b4-45b9-9e69-038635950974 to reach target state complete. Session template: compute-24.4.1.x86_64-596713
...
haasken-hpe commented 3 months ago

And here we can see the changed prompts from my other commit in this PR:

ncn-m001:/mnt/developer/haasken # sat bootsys shutdown --stage bos-operations --bos-template haasken-does-not-exist --bos-limit x9000c1s0b0n1
Proceed with shutdown of nodes using BOS? [yes,no] no
Will not proceed with shutdown of nodes using BOS. Exiting.
ncn-m001:/mnt/developer/haasken # sat bootsys reboot --stage bos-operations --bos-template haasken-does-not-exist --bos-limit x9000c1s0b0n1
Proceed with reboot of nodes using BOS? [yes,no] no
Will not proceed with reboot of nodes using BOS. Exiting.
haasken-hpe commented 3 months ago

/backport release/3.28