Cray-HPE / sat

System Admin Toolkit
https://cray-hpe.github.io/docs-sat/
MIT License
4 stars 5 forks source link

CRAYSAT-1888: Boot Kubernetes master and worker nodes simultaneously #251

Closed annapoorna-s-alt closed 2 months ago

annapoorna-s-alt commented 2 months ago

Summary and Scope

Modify sat bootsys boot --stage ncn-power to boot the master nodes (other than ncn-m001) and the worker nodes at the same time.

Issues and Related PRs

Resolves CRAYSAT-1888.

Testing

List the environments in which these changes were tested.

Tested on:

Rocket

Test description:

Will test sat bootsys boot --stage ncn-power which should boot masters and worker nodes simultaneously. In turn, it should take less time to complete this stage.

Risks and Mitigations

Low risk as we only booting the worker and masters parallelly

Pull Request Checklist

annapoorna-s-alt commented 2 months ago

Latest output

ncn-m001:~/annapoorna # sat --loglevel debug bootsys boot --stage ncn-power --ncn-boot-timeout 900 IPMI username: root IPMI password: The following Non-compute Nodes (NCNs) will be included in this operation: managers:

  • ncn-m002
  • ncn-m003 storage:
  • ncn-s001
  • ncn-s002
  • ncn-s003 workers:
  • ncn-w001
  • ncn-w002
  • ncn-w003
  • ncn-w004

The following Non-compute Nodes (NCNs) will be excluded from this operation: managers:

  • ncn-m001 storage: [] workers: []

Are the above NCN groupings and exclusions correct? [yes,no] yes DEBUG: BEGIN: boot of other management NCNs INFO: Starting console logging on ncn-s001,ncn-w001,ncn-m002,ncn-w003,ncn-s003,ncn-w004,ncn-s002,ncn-w002,ncn-m003. DEBUG: Executing command "ipmitool -U root -E -H ncn-s001-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-s001-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "ipmitool -U root -E -H ncn-w001-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-w001-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "ipmitool -U root -E -H ncn-m002-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-m002-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "ipmitool -U root -E -H ncn-w003-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-w003-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "ipmitool -U root -E -H ncn-s003-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-s003-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "ipmitool -U root -E -H ncn-w004-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-w004-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "ipmitool -U root -E -H ncn-s002-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-s002-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "ipmitool -U root -E -H ncn-w002-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-w002-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "ipmitool -U root -E -H ncn-m003-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Command "ipmitool -U root -E -H ncn-m003-mgmt -I lanplus sol deactivate" exited with non-zero exit status: 1, stderr: b'Info: SOL payload already de-activated\n', stdout: b'' DEBUG: Executing command "screen -ls" on host ncn-m001. DEBUG: Executing command "mkdir -p /var/log/cray/console_logs" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-s001-mgmt.log -A -m -d -S SAT-console-ncn-s001-mgmt ipmitool -U root -E -H ncn-s001-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-w001-mgmt.log -A -m -d -S SAT-console-ncn-w001-mgmt ipmitool -U root -E -H ncn-w001-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-m002-mgmt.log -A -m -d -S SAT-console-ncn-m002-mgmt ipmitool -U root -E -H ncn-m002-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-w003-mgmt.log -A -m -d -S SAT-console-ncn-w003-mgmt ipmitool -U root -E -H ncn-w003-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-s003-mgmt.log -A -m -d -S SAT-console-ncn-s003-mgmt ipmitool -U root -E -H ncn-s003-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-w004-mgmt.log -A -m -d -S SAT-console-ncn-w004-mgmt ipmitool -U root -E -H ncn-w004-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-s002-mgmt.log -A -m -d -S SAT-console-ncn-s002-mgmt ipmitool -U root -E -H ncn-s002-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-w002-mgmt.log -A -m -d -S SAT-console-ncn-w002-mgmt ipmitool -U root -E -H ncn-w002-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Executing command "screen -L -Logfile /var/log/cray/console_logs/console-ncn-m003-mgmt.log -A -m -d -S SAT-console-ncn-m003-mgmt ipmitool -U root -E -H ncn-m003-mgmt -I lanplus sol activate" on host ncn-m001. DEBUG: Waiting 5 seconds to ensure console logging screen sessions remain active. DEBUG: Executing command "screen -ls" on host ncn-m001. DEBUG: Console logging screen sessions remain active. INFO: Powering on NCNs and waiting up to 900 seconds for them to be reachable via SSH: ncn-s001, ncn-s002, ncn-s003 DEBUG: Entered pre_wait_action with self.send_command: True. INFO: Sending IPMI power on command to host ncn-s003 INFO: Sending IPMI power on command to host ncn-s001 INFO: Sending IPMI power on command to host ncn-s002 INFO: Powered on NCNs: ncn-s001, ncn-s002, ncn-s003 INFO: Unfreezing Ceph INFO: Running command: ceph osd unset noout INFO: Command output: noout is unset INFO: Running command: ceph osd unset norecover INFO: Command output: norecover is unset INFO: Running command: ceph osd unset nobackfill INFO: Command output: nobackfill is unset DEBUG: BEGIN: wait for ceph health INFO: Waiting up to 60 seconds for Ceph to become healthy after unfreeze INFO: Checking Ceph health INFO: Ceph is healthy. DEBUG: END: wait for ceph health. Duration: 0:00:00.639443 INFO: Ceph unfreeze completed successfully on storage NCNs. INFO: Checking whether ceph filesystem is mounted on /etc/cray/upgrade/csm. INFO: ceph filesystem is already mounted on /etc/cray/upgrade/csm. INFO: Checking whether fuse.s3fs filesystem is mounted on /var/opt/cray/sdu/collection-mount. INFO: fuse.s3fs filesystem is already mounted on /var/opt/cray/sdu/collection-mount. INFO: Checking whether fuse.s3fs filesystem is mounted on /var/opt/cray/config-data. INFO: fuse.s3fs filesystem is already mounted on /var/opt/cray/config-data. INFO: Successfully restarted 'cray-sdu-rda' service on ncn-m001 INFO: Powering on NCNs and waiting up to 900 seconds for them to be reachable via SSH: ncn-m002, ncn-m003, ncn-w001, ncn-w002, ncn-w003, ncn-w004 DEBUG: Entered pre_wait_action with self.send_command: True. INFO: Sending IPMI power on command to host ncn-w001 INFO: Sending IPMI power on command to host ncn-m002 INFO: Sending IPMI power on command to host ncn-w003 INFO: Sending IPMI power on command to host ncn-w004 INFO: Sending IPMI power on command to host ncn-w002 INFO: Sending IPMI power on command to host ncn-m003 INFO: Powered on NCNs: ncn-m002, ncn-m003, ncn-w001, ncn-w002, ncn-w003, ncn-w004 INFO: Stopping console logging on ncn-s001,ncn-w001,ncn-m002,ncn-w003,ncn-s003,ncn-w004,ncn-s002,ncn-w002,ncn-m003. DEBUG: Executing command "ipmitool -U root -E -H ncn-s001-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "ipmitool -U root -E -H ncn-w001-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "ipmitool -U root -E -H ncn-m002-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "ipmitool -U root -E -H ncn-w003-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "ipmitool -U root -E -H ncn-s003-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "ipmitool -U root -E -H ncn-w004-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "ipmitool -U root -E -H ncn-s002-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "ipmitool -U root -E -H ncn-w002-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "ipmitool -U root -E -H ncn-m003-mgmt -I lanplus sol deactivate" on host ncn-m001. DEBUG: Executing command "screen -ls" on host ncn-m001. DEBUG: Executing command "screen -XS 1234038.SAT-console-ncn-m003-mgmt quit" on host ncn-m001. DEBUG: END: boot of other management NCNs. Duration: 0:00:36.609538 INFO: Succeeded with boot of other management NCNs.