Open Mistrblank opened 6 years ago
Hi, we got 2 nights in a row issue for applying back-to-back 2 different fix packs for a single Appliance from the same playbook. Eventually got some nasty system errors. We had to revert to a previous Hypervisor snapshot.
So, it is only this evening that I realized that I should let the first fixpack complete and let the playbook end; then only navigate in LMI then apply the second FP with another playbook and it did OK. As you say, your suspicious that there is a premature "resume" in the roles/handlers’ logic before the Appliance is ready to do so make sense.
This is firmware and fixpack season it seems.
From 9.0.7 Appliances, applying 9070_IF1 results in the Appliance automatic reboot. No need to force-reboot the Appliance within the install_fixpack role, as it is doing in its current implementation, which yield the following error:
[2019-09-18 08:48:09,612] [PID:32147 TID:139756887574336] [DEBUG] [ibmsecurity.appliance.ibmappliance] [_url():31] Issuing request to: https://stha9n0kq.iad.ca.inet:443/diagnostics/restart_shutdown/reboot\n[2019-09-18 08:48:09,615] [PID:32147 TID:139756887574336] [CRITICAL] [ibmsecurity.appliance.ibmappliance] [_process_connection_error():98] Failed to connect to server.\n", "msg": "('HTTP Return code: 502', 'Failed to connect to server')", "name": "ibmsecurity.isam.appliance.reboot"
The call to the “/diagnostics/restart_shutdown/reboot” fails because the Appliance is already in the process of rebooting, so this API call seems to be irrelevant. Not sure if the automatic reboot after fixpack are “installed” has always been like this or a recent change in latest firmwares … ?
To revolve the fatal error that it generates, I’ve replaced the “Reboot Appliance” handler call in the install_fixpack role with “Await Appliance LMI Response” and that works out good for us.
Not sure is this should become for all the default behavior ?
During the fixpack procedure installation, there is a call to the "Commit Changes" handler. This call has a tendency to fail from 404 errors. I suspect the issue is that there is no wait time after the installation of the fixpack to account for fixpacks that restart the LMI or the appliance itself.
I suspect that an "Await Appliance Commit LMI Response" may need to be added to the notification list, however, it will require the handlers/main.yml in start_config to be reordered such that Await occurs before Commit as Ansible documentation seems to indicate they would be executed in the order they are defined, not called.