ManageIQ / manageiq

ManageIQ Open-Source Management Platform
https://manageiq.org
Apache License 2.0
1.35k stars 898 forks source link

Mark workers associated with failed systemd units as stopped #23182

Closed agrare closed 1 month ago

agrare commented 2 months ago

If we start a systemd unit and it fails this can leave the miq_worker record associated with it in "creating" without ever being cleaned up.

When we stop and cleanup any failed systemd units we should also mark any associated miq-worker records as stopped so that they can be cleaned up by the clean_worker_records method.

INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#cleanup_failed_systemd_services) Disabling failed unit files: [opentofu-runner.service]
INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#cleanup_failed_systemd_services) Stopping worker records for failed units: [opentofu-runner.service]
INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#clean_worker_records) SQL Record for Worker [OpentofuWorker] with ID: [71], PID: [], GUID: [46e4cdf4-22b8-426>

TODO

Fixes https://github.com/ManageIQ/manageiq-providers-embedded_terraform/issues/59

miq-bot commented 1 month ago

Checked commits https://github.com/agrare/manageiq/compare/2906f85a40d41ecf5f3e68c38372bb54bc9ca5e9~...728e223638e3dd99332c48a934ed0f5dd608cf59 with ruby 3.1.5, rubocop 1.56.3, haml-lint 0.51.0, and yamllint 2 files checked, 0 offenses detected Everything looks fine. :trophy:

agrare commented 1 month ago

Okay I ran a live test on a master appliance build with this applied and I enable the embedded_terraform role first then set the container_image later and confirmed the failed workers are marked stopped and later deleted and then after the container_image setting is set properly the next time the worker starts up it pulls the correct image. Taking out of WIP

Fryguy commented 1 month ago

Backported to radjabov in commit e6e6c81e8cceafbbb2be8ee4852c8aaf8bf23867.

commit e6e6c81e8cceafbbb2be8ee4852c8aaf8bf23867
Author: Jason Frey <fryguy9@gmail.com>
Date:   Fri Sep 27 16:04:07 2024 -0400

    Merge pull request #23182 from agrare/mark_workers_for_failed_units_stopped

    Mark workers associated with failed systemd units as stopped

    (cherry picked from commit de72e9e6b5d67e724113fd6852ec31867fada811)