dell / dellemc-openmanage-ansible-modules

Dell OpenManage Ansible Modules
GNU General Public License v3.0
335 stars 163 forks source link

[BUG]: Module dellemc.openmanage.ome_firmware hangs on execution #494

Closed sbeyermann closed 1 year ago

sbeyermann commented 1 year ago

Bug Description

I try to upgrade the firmware of one certain Dell PowerEdge R750 server using the dellemc.openmanage collection. I use the dellemc.openmanage.ome_firmware module to achieve this with the below task. However on execution of the playbook the execution stops when the firmware update should be started. It does not abort with an error message nor does it terminate after the default 30s.

I already used other modules like dellemc.openmanage.ome_device_info that worked fine. So the basic communication between my ansible host and Dell OpenManage Enterprise should not be an issue.

Although I opened this as a bug report I definitely won't rule out a user error :smile:

Component or Module Name

dellemc.openmanage.ome_firmware from collection dellemc.openmanage:7.4.0

Ansible Version

ansible 2.14.4

Python Version

Python 3.9.2

iDRAC/OME/OME-M version

Dell OpenManage Enterprise 3.10.0

Operating System

Debian 11.6

Playbook Used

playbook-test-firmware.yml

  tasks:
  - name: Update firmware of one specific device
    dellemc.openmanage.ome_firmware:
      hostname: myome.company.com
      username: admin
      password: somepassword
      validate_certs: false
      baseline_name: R750General
      device_id:
        - 10165
    delegate_to: localhost

Logs

TASK [Update firmware of one specific device] ********************************************************************************************************************************************************************************************
task path: /scripts/ansible/pod/playbook-test-firmware.yml:43
<localhost> ESTABLISH LOCAL CONNECTION FOR USER: root
<localhost> EXEC /bin/sh -c 'echo ~root && sleep 0'
<localhost> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo /root/.ansible/tmp `"&& mkdir "` echo /root/.ansible/tmp/ansible-tmp-1681284969.390742-210-67254126161417 `" && echo ansible-tmp-1681284969.390742-210-67254126161417="` echo /root/.ansible/tmp/ansible-tmp-1681284969.390742-210-67254126161417 `" ) && sleep 0'
Using module file /root/.ansible/collections/ansible_collections/dellemc/openmanage/plugins/modules/ome_firmware.py
<localhost> PUT /root/.ansible/tmp/ansible-local-206iya4s_98/tmpfxbwszls TO /root/.ansible/tmp/ansible-tmp-1681284969.390742-210-67254126161417/AnsiballZ_ome_firmware.py
<localhost> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1681284969.390742-210-67254126161417/ /root/.ansible/tmp/ansible-tmp-1681284969.390742-210-67254126161417/AnsiballZ_ome_firmware.py && sleep 0'
<localhost> EXEC /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1681284969.390742-210-67254126161417/AnsiballZ_ome_firmware.py && sleep 0'

On this point the script simply hangs (does not stop/terminate/timeout). I do see logins happening in the OME audit log, but no firmware update jobs are being created on OME or in the iDRAC job-queue of the target server.

Steps to Reproduce

This is no intermittent issue. It happens every time I execute the above playbook/task

I already tried multiple variation of the above task, with and without check_mode enabled:

Expected Behavior

The module dellemc.openmanage.ome_firmware should create a firmware update job on OME which in turn does upgrade the firmware of the target device (identified by id or by service tag). If anything goes wrong, e. g. a unknown firmware baseline, it should display an according error message

Actual Behavior

The module dellemc.openmanage.ome_firmware hangs and does not do anything besides logging in to OME

Screenshots

No response

Additional Information

No response

anupamaloke commented 1 year ago

@sbeyermann, could you please share the playbooks that you used for creating the repository and the firmware baseline as well? Are you using Update Manager Plugin for creating the repository?

sbeyermann commented 1 year ago

@anupamaloke Unfortunately the repository and firmware-baselines where not created with ansible but manually. And yes, we are used the Update Manger Plugin (version 1.4.0.336) for creating the repository.

anupamaloke commented 1 year ago

@sbeyermann, thank you for sharing this information. This is very helpful. Could you also please share the below information:

sbeyermann commented 1 year ago

@anupamaloke Thank you for your quick reply. Here are the other information you requested:

anupamaloke commented 1 year ago

@sbeyermann, thank you for sharing the details. It seems that there is an issue with Baselines. From the JSON output that you posted above for /api/UpdateService/Baselines, the @odata.count shows 6 as the total number of baselines, however the actual list shows up only 2 baselines, namely CompanyESX20230323 and R750General.

Due to this, in ome.py, the get_all_report_details method might be running into an infinite loop and that's why the playbook hangs. There is something fishy going on in OME.

Could you please drop us an email at OpenManageAnsible@Dell.com so that we can engage support team to investigate the issue with the OME?

sbeyermann commented 1 year ago

@anupamaloke, thank you for your analysis. If I understand it correctly an issue with our Dell OME installation prevents the dellemc.openmanage ansible module from functioning correctly. Is that right?

Of course I will send an e-mail with the issue details to OpenManageAnsible@Dell.com so that we can create a support request at Dell for OME.

sbeyermann commented 1 year ago

@anupamaloke, just a quick update on this issue: I opened a service request with the Dell OpenManage Enterprise team and they were able to confirm your suspicion. There is a bug in the interaction between Dell OpenManage Enterprise 3.10.1 and Dell OpenManage Enterprise Update Manager plug-in 1.4.0.336. This leads to an increased @odata.count counter when repositories are getting added and deleted in Update Manager plug-in.

I was told that they are working on fixing this issue with the next release of Dell OpenManage Enterprise and/or the Dell OpenManage Enterprise Update Manager plug-in. There is now ETA yet but it could be June/July 2023.

As soon as the new version is available I will update our environment and try again.

anupamaloke commented 1 year ago

@sbeyermann, thank you for sharing the update. We are also tracking this internally and will update accordingly.

vzovko commented 1 year ago

Same issue here with OME Version 3.10.1 (Build 51), Update Manager 1.4.0.336 and PowerEdge R650. Count doesn't match the actual elements of the array.

Waiting for the fix :)

sbeyermann commented 1 year ago

I've received a feedback from Dell that the fix for this issue will be probably included in Dell OpenManage Enterprise 4.0 (ETA October 2023). But there seems to be a "private fix" available now that you can request by opening a ticket with Dell. I just did that and will report if it worked in our environment.

sbeyermann commented 1 year ago

I received the ominous "private fix" from Dell and have some more information. Basically it is a new version 1.4.1.364 of the Dell OpenManage Enterprise Update Manager Plugin. I installed it in our environment and afterwards the http GET to https://<OME-IP>/api/UpdateService/Baselines returned the correct number of baselines in the @odata.count field.

Wit the new version of Update Manager Plugin installed I was successfully able to trigger the installation of firmware using the dellemc.openmanage.ome_firmware module.

@anupamaloke, I guess we can close the issue here. As soon as Dell releases OME 4.0 with the new Update Manager Plugin the issue will be resolved.