IBM / ansible-power-hmc

Developer contributions for Ansible Automation on Power (HMC)
GNU General Public License v3.0
26 stars 41 forks source link

HMC_update_upgrade module finish with FAILED: Hmc not responding after reboot #84

Closed jurajhajka closed 2 years ago

jurajhajka commented 2 years ago

Describe the bug HMC_update_upgrade module finish with FAILED: Hmc not responding after reboot

TASK [debug] ** Tuesday 23 August 2022 07:51:03 EDT (0:00:00.092) 0:00:02.326 ** ok: [vhmc_ansible] => missing_ifixes:

TASK [Installing missing ifixes] ** Tuesday 23 August 2022 07:51:03 EDT (0:00:00.067) 0:00:02.394 ** included: /home/ansau/project/fiserv-ansible/playbooks/hmc_update.yml for vhmc_ansible

TASK [Update the HMC to the V9R2M952 build level from sftp location] ** Tuesday 23 August 2022 07:51:03 EDT (0:00:00.105) 0:00:02.499 ** fatal: [vhmc_ansible]: FAILED! => changed=false msg: 'FAILED: Hmc not responding after reboot' ...ignoring

TASK [pause] ** Tuesday 23 August 2022 08:53:52 EDT (1:02:48.995) 1:02:51.495 ** [pause]

hscroot@vhmcansible:~> who -b system boot Aug 23 11:53 hscroot@vhmcansible:~> lshmc -V "version= Version: 9 Release: 1 Service Pack: 942 HMC Build level 2011270432 MH01759 - HMC V9R1 M920 [x86_64] MH01787 - Required fix for HMC V9R1 M920 [x86_64] MH01789 - HMC V9R1 Service Pack 1 Release (M921) [x86_64] MH01800 - iFix for HMC V9R1 M921 MH01808 - iFix for HMC V9R1 M921 MH01810 - HMC V9R1 M930 MH01820 - iFix for HMC V9R1 M910+ MH01825 - iFix for HMC V9R1 M930 MH01857 - Save upgrade fix for HMC V9R1 M910+ MH01876 - HMC V9R1 M942 ","base_version=V9R1 "

hscroot@vhmcansible:~> Expected behavior reconnect to rebooted HMC and check versions

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information): HMC: tested with several versions of HMC code [V9R952, V9R1M910] Python 3.7.12 OpenSSH_8.1p1, OpenSSL 1.0.2u

Additional context

AnilVijayan commented 2 years ago

Can you mention exactly from which driver level (i guess it was 942) to which level, the upgrade was triggered? Please share the playbook if possible. Does the HMC came back online after the failure? Is it pingable? Also share the output of the command lshmc -v

jurajhajka commented 2 years ago

I applied ifix MH01857 - for 941 level. HMC came back online you have output after reboot ifix is there. Then I did update to 950 level with same result. HMC came back with correct updated level but playbook finished with same error. hscroot@vhmcansible:~> lshmc -V "version= Version: 9 Release: 2 Service Pack: 950 HMC Build level 2010230054 ","base_version=V9R2 "

AnilVijayan commented 2 years ago

We typically see the error Hmc not responding after reboot, in case HMC is not pingable even after waiting for 60 mins post update/upgrade. Can you confirm did it really take that much of time for it to come back online after reboot?

jurajhajka commented 2 years ago

MH01857 is very small ifix few kb. vHMC was back online in 10 min. upgrade to 950 was also faster < 60 min.

jurajhajka commented 2 years ago

Some upgrades/patches could take more then 60 min.

AnilVijayan commented 2 years ago

Can you run the below python code snippet from the ansible control node (node on which playbook is triggered) ? Replace the with HMC ip address.

import re
import subprocess
def pingTest(i_host):
        pattern = re.compile(r"(\d) received")
        report = ("No response", "Partial Response", "Alive")
        cmd = "ping -c 2 " + i_host.strip()

        result = "No response"
        with subprocess.Popen(cmd, shell=True, executable="/bin/bash",
                              stdout=subprocess.PIPE,
                              stderr=subprocess.PIPE) as proc:

            stdout_value, stderr_value = proc.communicate()
            if isinstance(stdout_value, bytes):
                stdout_value = stdout_value.decode("ascii")

            igot = re.findall(pattern, stdout_value)
            if igot:
                result = report[int(igot[0])]

        return result

print(pingTest("<hmc_ip>"))
jurajhajka commented 2 years ago

test

AnilVijayan commented 2 years ago

The issue looks like the ping command result is a bit different on your control node than usually expected on a linux machine. Instead of 2 packets transmitted, 2 received, 0% packet loss, time 1025ms it is giving 2 packets transmitted, 2 packets received, 0% packet loss, time 1025ms

Can you please run the below modified code snippet again to confirm that?

import subprocess
def pingTest(i_host):
        pattern = re.compile(r"(\d) (packets\s)?received")
        report = ("No response", "Partial Response", "Alive")
        cmd = "ping -c 2 " + i_host.strip()

        result = "No response"
        with subprocess.Popen(cmd, shell=True, executable="/bin/bash",
                              stdout=subprocess.PIPE,
                              stderr=subprocess.PIPE) as proc:

            stdout_value, stderr_value = proc.communicate()
            if isinstance(stdout_value, bytes):
                stdout_value = stdout_value.decode("ascii")

            igot = re.findall(pattern, stdout_value)
            if igot:
                result = report[int(igot[0][0])]

        return result

print(pingTest("<hmc_ip>"))

Curious on the linux flavour you are using on the control node

jurajhajka commented 2 years ago

We are running control node on AIX

jurajhajka commented 2 years ago

ansau@a9tvap105:/home/ansau$ python hmc_test2.py Alive ansau@a9tvap105:/home/ansau$

looks much better

jurajhajka commented 2 years ago

diff hmc_test.py hmc_test2.py 4c4 pattern = re.compile(r"(\d) received") pattern = re.compile(r"(\d) (packets\s)?received") 8c9 with subprocess.Popen(cmd, shell=True, executable="/usr/bin/bash", with subprocess.Popen(cmd, shell=True, executable="/bin/bash", 18c19 result = report[int(igot[0])] result = report[int(igot[0][0])] 23d23

AnilVijayan commented 2 years ago

Added the fix with commit: a9bd5b2a497e3. This will be available with latest version v1.6.0

jurajhajka commented 2 years ago

Thank you.