ManageIQ / manageiq

ManageIQ Open-Source Management Platform
https://manageiq.org
Apache License 2.0
1.34k stars 898 forks source link

User must retire a MIQ service twice to succeed #22556

Open johny5v opened 1 year ago

johny5v commented 1 year ago

Users must retire service twice to succeed.

Recently we upgraded to Najdorf-1.3 from Kasparov, and one type of our service retirement started to behave strangely. We have automatic provisioning for K8S cluster. The service template is based on Ansible Tower job, which creates VMs on OpenStack via Terraform and installs K8S clusters. Finally (when the MIQ OpenStack provider discovers them) we link new VMs to the MIQ service. For retirement we have modified /Cloud/VM/Retirement/StateMachines/VMRetirement/Default by adding an assertion step to the top. This step checks VM custom keys for Terraform ID and skips the rest of the VM retirement state machine if such key exists on the VM. Then the service retirement task calls another Ansible Tower job which performs Terraform destroy. At least this worked fine until we migrate to Najdorf a month ago. Now a user starts a retirement on K8S service, VM retirements tasks are spawned and processed bud nothing happens. Retirement request stays in Active state forever, log ends with

[----] I, [2023-06-07T11:30:12.769456 #2201:93a8]  INFO -- automation: Q-task_id([r50662_vm_retire_task_80773]) <AEMethod [/KBCZ-Openstack/Cloud/VM/Retirement/StateMachines/VMRetirement/update_retirement_status]> Ending
[----] I, [2023-06-07T11:30:12.769678 #2201:93a8]  INFO -- automation: Q-task_id([r50662_vm_retire_task_80773]) Method exited with rc=MIQ_OK
[----] I, [2023-06-07T11:30:12.770929 #2201:93a8]  INFO -- automation: Q-task_id([r50662_vm_retire_task_80773]) Next State=[]
[----] I, [2023-06-07T11:30:12.771374 #2201:93a8]  INFO -- automation: Q-task_id([r50662_vm_retire_task_80773]) Followed  Relationship [miqaedb:/Cloud/VM/Retirement/StateMachines/VMRetirement/SkipBecauseTerraform#create]
[----] I, [2023-06-07T11:30:12.772595 #2201:93a8]  INFO -- automation: Q-task_id([r50662_vm_retire_task_80773]) Followed  Relationship [miqaedb:/cloud/VM/Lifecycle/Retirement#create]

Repeated attempts lead to the same result. So I replaced the Cloud VM retirement state machine with my version, which performs only FinishRetirement step. So now when a user tries to retire this service for the first time, all its VMs get changed to the Retired state in MIQ (although the request still stays in the Active state forever). Then on a subsequent retirement attempt finally service retirement task is spawned in the log and VM's are finally physically destroyed via Ansible Tower Terraform job and MIQ service is retired. This second retirement request finished as a success. I can provide automation log for both requests if needed.

Any help or idea would be appreciated. Thank you.

jbarson47 commented 1 year ago

In case it's useful for investigating the issue, we are seeing the same behavior in our Najdorf deployment for service retires, with the same last logs from the VM retire tasks for the Service Retire request ID - however, filtering logs by request ID AND "service_retire", we see the following logs before the VM retires, likely explaining why the request is "dying" after the VM retires complete (though as mentioned, service retire request stays active in perpetuity and does not error out by itself).

ERROR -- evm: Q-task_id([r856059_service_retire_request_856059]) /var/www/miq/vmdb/app/models/miq_request_task.rb:125:in `task_check_on_delivery'#012/var/www/miq/vmdb/app/models/miq_retire_task.rb:18:in `deliver_to_automate'#012/var/www/miq/vmdb/app/models/service_retire_task.rb:53:in `tap'#012/var/www/miq/vmdb/app/models/service_retire_task.rb:53:in `block in create_retire_subtasks'#012/opt/manageiq/manageiq-gemset/gems/activerecord-6.0.5.1/lib/active_record/relation/delegation.rb:88:in `each'#012/opt/manageiq/manageiq-gemset/gems/activerecord-6.0.5.1/lib/active_record/relation/delegation.rb:88:in `each'#012/var/www/miq/vmdb/app/models/service_retire_task.rb:39:in `collect'#012/var/www/miq/vmdb/app/models/service_retire_task.rb:39:in `create_retire_subtasks'#012/var/www/miq/vmdb/app/models/service_retire_task.rb:33:in `block in after_request_task_create'#012/opt/manageiq/manageiq-gemset/gems/activerecord-6.0.5.1/lib/active_record/relation/delegation.rb:88:in `each'#012/opt/manageiq/manageiq-gemset/gems/activerecord-6.0.5.1/lib/active_record/relation/delegation.rb:88:in `each'#012/var/www/miq/vmdb/app/models/service_retire_task.rb:30:in `after_request_task_create'#012/var/www/miq/vmdb/app/models/miq_request.rb:493:in `create_request_task'#012/var/www/miq/vmdb/app/models/miq_request.rb:465:in `block in create_request_tasks'#012/var/www/miq/vmdb/app/models/miq_request.rb:464:in `each'#012/var/www/miq/vmdb/app/models/miq_request.rb:464:in `create_request_tasks'#012/var/www/miq/vmdb/app/models/miq_queue.rb:484:in `block in dispatch_method'#012/usr/share/ruby/timeout.rb:95:in `block in timeout'#012/usr/share/ruby/timeout.rb:33:in `block in catch'#012/usr/share/ruby/timeout.rb:33:in `catch'#012/usr/share/ruby/timeout.rb:33:in `catch'#012/usr/share/ruby/timeout.rb:110:in `timeout'#012/var/www/miq/vmdb/app/models/miq_queue.rb:482:in `dispatch_method'#012/var/www/miq/vmdb/app/models/miq_queue.rb:459:in `block in deliver'#012/var/www/miq/vmdb/app/models/user.rb:382:in `w
ERROR -- evm: Q-task_id([r856059_service_retire_request_856059]) [RuntimeError]: Service Retire request is already being processed Method:[block (2 levels) in <class:LogProxy>]
miq-bot commented 10 months ago

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

miq-bot commented 6 months ago

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

miq-bot commented 3 months ago

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

miq-bot commented 1 week ago

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.