Using install_firmware fails on start_config_wait_time

Mistrblank commented 5 years ago

Is this a variable I'm supposed to supply or is it configured elsewhere because it's in the start_config handlers. It's not obvious either way, but it is failing for me and I don't have any documentation of what that variable even does.

sygilber commented 5 years ago

Hi. You are not alone getting this error with the hardcoded timeout found in the start_config handler while running the install_firmware in a playbook. I guess we are more and more in those days trying to applying 9.0.5 on top of previous version.

I had began work with enthousiasme to elaborate a version of start_config in which no timeout would be left hardcoded in the role. To allow us to increase the timeout to longer values such as 240 s or more when installing firmware. I must say that i did achieved my goal with the code change but only to find that i would get same failures. I tried numerous time (between each run reverting to previous hypervisor snapshot) and i obtained not ‘stable’ results (some time it passed, sometime not). So in the end i decided not to push this code to github. We are promoting this code change in other environments and so far it did not happen again. I feel this is a racing/timing condition, or could be due to differences in Ansible version.

All this is to say that i feel there are numerous variables that can influence this condition to occur or not.

We will continue to pay attention while we get 9.0.5 rolled out everywhere first, and share observations/traces made along the process.

I know this does not resolve your issue but the sole fact of you sharing this behavior reassure us that we are not alone and that all together we can crush this behavior.

Mistrblank commented 5 years ago

Same situation, it makes me feel a little better that someone else is seeing similar results. At this time, I've separated my firmware and fixpack from my core configuration playbooks as they need to be run in separate execution to avoid these fails from stopping the remainder of execution.

sygilber commented 5 years ago

Hi, we deployed 9.0.5 in higher environment, and got again a case where it complains with same errors.
I am starting to think that there could be a timing/logic issue. Let me explain: The time that for the firmware image gets uploaded to the Appliance and applied (and RP configs and other stuff get upgraded – before or after reboot I am unsure), if this takes longer than the hardcoded timeout in start_config handler, then is it possible that the wait_for Python code in the ibmsecurity lib “completes” while the appliance has not yet even initiated its full reboot. Meaning that it could start probing for the LMI URL availability and thus wrongly believe that the Appliance reboot is done, whereas it is not being initiated yet or just entered the reboot process? I have no trace to prove it. To get more serious about it we will need more time and more trace. Just sharing the idea for now if this rings a bell to any one else. Ram, does the code checks to see if the appliance did reboot (last boot date) before continuing?

ram-ibm commented 5 years ago

Recent Fixpacks require a MANUAL reboot. The playbooks expect the reboot to happen automatically. Please be sure to install fixpack and reboot before proceeding. Reboot is very important.

ram-ibm commented 5 years ago

Need to fix the playbook / role to kick off a reboot manually.

ram-ibm commented 5 years ago

After checking the code to see if we are detecting completion of fixpack with the new setup.

Mistrblank commented 5 years ago

So hardcoding the wait into my inventory or the playbook "fixes" the issue. But either this should be set with a default setting in the start_config or properly documented.

I just want to be careful though, the issue I proposed is actually just an issue with the install_firmware and you're referencing install_fixpack. The same issue crops up in fixpacks though. It's simlarly handled with hardcoding a variable. Again not intuitive if I have manually recurse the code every handler to determine additional variables for every call I'm making (though this is potentially a documentation issue though depending on how you want to handle it).

However to your point, the reboot after every fixpack also is understood and I have written dummy plays to kick off the reboot handler, but the code for install_fixpack still immediately calls a commit which is doomed to fail because you do not have a reboot of the appliance before the commit (which by the way will require some reworking of the handlers which may not be possible due to commit being the first handler in the role). I have to write my plays to bypass failures of the install_fixpack because of this call that will never succeed or break up the execution of my plays with a reboot play/playbook. None of this is ideal.

For example:

sygilber commented 5 years ago

Hi all, May i ask why a commit after applying a fixpack is required ? Is it a requirement by current code or bu appliance ?

IBM-Security / isam-ansible-roles

Using install_firmware fails on start_config_wait_time #92