apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.11k stars 1.11k forks source link

Shutdown expunged resources cleanup executor properly, and allow other components to configure/start/stop on error #9723

Closed sureshanaparti closed 1 month ago

sureshanaparti commented 2 months ago

Description

This PR shutdowns expunged resources cleanup executor when obj is available (when config expunged.resources.purge.enabled is true), allows other components to configure/start/stop on error, and adds some logs in component lifecycle classes.

Noticed this exception with custom logs, the remaining components fails to stop after this exception.

WARN  [o.a.c.s.l.CloudStackExtendedLifeCycleStart] (SpringContextShutdownHook:null) (logid:) Error on stopping beans - null
java.lang.NullPointerException
        at org.apache.cloudstack.resource.ResourceCleanupServiceImpl.stop(ResourceCleanupServiceImpl.java:584)
        at org.apache.cloudstack.spring.lifecycle.CloudStackExtendedLifeCycle$2.with(CloudStackExtendedLifeCycle.java:105)
        at org.apache.cloudstack.spring.lifecycle.CloudStackExtendedLifeCycle.with(CloudStackExtendedLifeCycle.java:159)
        at org.apache.cloudstack.spring.lifecycle.CloudStackExtendedLifeCycle.stopBeans(CloudStackExtendedLifeCycle.java:101)
        at org.apache.cloudstack.spring.lifecycle.CloudStackExtendedLifeCycleStart.stop(CloudStackExtendedLifeCycleStart.java:32)
        at org.apache.cloudstack.spring.lifecycle.AbstractSmartLifeCycle.stop(AbstractSmartLifeCycle.java:49)
        at org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:234)
        at org.springframework.context.support.DefaultLifecycleProcessor.access$300(DefaultLifecycleProcessor.java:54)
        at org.springframework.context.support.DefaultLifecycleProcessor$LifecycleGroup.stop(DefaultLifecycleProcessor.java:373)
        at org.springframework.context.support.DefaultLifecycleProcessor.stopBeans(DefaultLifecycleProcessor.java:206)
        at org.springframework.context.support.DefaultLifecycleProcessor.onClose(DefaultLifecycleProcessor.java:129)
        at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1069)
        at org.springframework.context.support.AbstractApplicationContext$1.run(AbstractApplicationContext.java:993)

Fixes #9722

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

Mgmt2 service stopped =>

MgmtServers_4 20 0 0-SNAPSHOT_Fixed

How Has This Been Tested?

Manually tested management server start & stop.

2024-09-23 18:20:51,111 INFO  [o.a.c.s.l.CloudStackExtendedLifeCycle] (SpringContextShutdownHook:[]) (logid:) stopping bean ClusterServiceServletAdapter
2024-09-23 18:20:51,111 INFO  [o.a.c.s.l.CloudStackExtendedLifeCycle] (SpringContextShutdownHook:[]) (logid:) stopping bean ClusterManagerImpl

[root@ol8 ~]# cat stopping-beans-check.txt | wc -l
665

stopping-beans-check.txt

How did you try to break this feature and the system with this change?

sureshanaparti commented 2 months ago

@blueorangutan package

blueorangutan commented 2 months ago

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 4.48%. Comparing base (b1f683d) to head (a6eb0b0). Report is 44 commits behind head on main.

:exclamation: There is a different number of reports uploaded between BASE (b1f683d) and HEAD (a6eb0b0). Click for more details.

HEAD has 1 upload less than BASE | Flag | BASE (b1f683d) | HEAD (a6eb0b0) | |------|------|------| |unittests|1|0|
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #9723 +/- ## ============================================ - Coverage 15.77% 4.48% -11.30% ============================================ Files 5621 392 -5229 Lines 491564 32154 -459410 Branches 61174 5672 -55502 ============================================ - Hits 77562 1441 -76121 + Misses 405545 30707 -374838 + Partials 8457 6 -8451 ``` | [Flag](https://app.codecov.io/gh/apache/cloudstack/pull/9723/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [uitests](https://app.codecov.io/gh/apache/cloudstack/pull/9723/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `4.48% <ø> (+0.43%)` | :arrow_up: | | [unittests](https://app.codecov.io/gh/apache/cloudstack/pull/9723/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

blueorangutan commented 2 months ago

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11181

sureshanaparti commented 2 months ago

@blueorangutan test

blueorangutan commented 2 months ago

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

sureshanaparti commented 2 months ago

@blueorangutan package

sureshanaparti commented 2 months ago

@blueorangutan package

blueorangutan commented 2 months ago

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 2 months ago

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11182

DaanHoogland commented 2 months ago

@sureshanaparti , this looks like a good cleanup. I wonder what it fixes other than just the looks of the code, though. We have two shutdown issues:

  1. prolonged time of shutdown
  2. no status update for shutdown MSses does this address voth, @sureshanaparti ? (I can see you showed some evidence for the second)
sureshanaparti commented 2 months ago

@sureshanaparti , this looks like a good cleanup. I wonder what it fixes other than just the looks of the code, though. We have two shutdown issues:

  1. prolonged time of shutdown
  2. no status update for shutdown MSses does this address voth, @sureshanaparti ? (I can see you showed some evidence for the second)

@DaanHoogland this updates MS status to Down when service is stopped/shutdown. (it doesn't address prolonged time of shutdown)

blueorangutan commented 2 months ago

[SF] Trillian test result (tid-11539) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 62071 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9723-t11539-kvm-ol8.zip Smoke tests completed. 134 look OK, 2 have errors, 5 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_03_secured_to_nonsecured_vm_migration Error 375.65 test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration Error 282.02 test_vm_life_cycle.py
all_test_vpc_redundant Skipped --- test_vpc_redundant.py
all_test_vpc_router_nics Skipped --- test_vpc_router_nics.py
all_test_vpc_vpn Skipped --- test_vpc_vpn.py
all_test_webhook_delivery Skipped --- test_webhook_delivery.py
all_test_webhook_lifecycle Skipped --- test_webhook_lifecycle.py
DaanHoogland commented 1 month ago

@blueorangutan test keepEnv

blueorangutan commented 1 month ago

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan commented 1 month ago

[SF] Trillian test result (tid-11551) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 125751 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9723-t11551-kvm-ol8.zip Smoke tests completed. 133 look OK, 5 have errors, 3 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestClusterDRS>:setup Error 0.00 test_cluster_drs.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_01_secure_vm_migration Error 135.10 test_vm_life_cycle.py
test_01_secure_vm_migration Error 135.11 test_vm_life_cycle.py
test_12_start_vm_multiple_volumes_allocated Error 1109.52 test_vm_life_cycle.py
test_12_start_vm_multiple_volumes_allocated Error 1109.53 test_vm_life_cycle.py
test_13_destroy_and_expunge_vm Error 32.81 test_vm_life_cycle.py
test_14_destroy_vm_delete_protection Error 38.62 test_vm_life_cycle.py
ContextSuite context=TestVMLifeCycle>:teardown Error 81.55 test_vm_life_cycle.py
ContextSuite context=TestCreateVolume>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVolumeEncryption>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVolumes>:setup Error 0.00 test_volumes.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Error 41204.43 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Error 50.88 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Error 172.41 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 67.11 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 218.98 test_vpc_redundant.py
ContextSuite context=TestVPCRedundancy>:teardown Error 326.28 test_vpc_redundant.py
all_test_vm_strict_host_tags Skipped --- test_vm_strict_host_tags.py
all_test_vnf_templates Skipped --- test_vnf_templates.py
all_test_vpc_ipv6 Skipped --- test_vpc_ipv6.py
blueorangutan commented 1 month ago

[SF] Trillian test result (tid-11554) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 59928 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9723-t11554-kvm-ol8.zip Smoke tests completed. 122 look OK, 1 have errors, 18 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestNatRuleUsage>:setup Error 973.44 test_usage.py
ContextSuite context=TestPublicIPUsage>:setup Error 1990.90 test_usage.py
ContextSuite context=TestSnapshotUsage>:setup Error 2414.42 test_usage.py
ContextSuite context=TestTemplateUsage>:setup Error 2540.51 test_usage.py
ContextSuite context=TestVmUsage>:setup Error 2627.83 test_usage.py
ContextSuite context=TestVolumeUsage>:setup Error 2792.50 test_usage.py
ContextSuite context=TestVpnUsage>:setup Error 2859.77 test_usage.py
all_test_vm_autoscaling Skipped --- test_vm_autoscaling.py
all_test_vm_deployment_planner Skipped --- test_vm_deployment_planner.py
all_test_vm_life_cycle Skipped --- test_vm_life_cycle.py
all_test_vm_lifecycle_unmanage_import Skipped --- test_vm_lifecycle_unmanage_import.py
all_test_vm_schedule Skipped --- test_vm_schedule.py
all_test_vm_snapshot_kvm Skipped --- test_vm_snapshot_kvm.py
all_test_vm_snapshots Skipped --- test_vm_snapshots.py
all_test_vm_strict_host_tags Skipped --- test_vm_strict_host_tags.py
all_test_vnf_templates Skipped --- test_vnf_templates.py
all_test_volumes Skipped --- test_volumes.py
all_test_vpc_ipv6 Skipped --- test_vpc_ipv6.py
all_test_vpc_redundant Skipped --- test_vpc_redundant.py
all_test_vpc_router_nics Skipped --- test_vpc_router_nics.py
all_test_vpc_vpn Skipped --- test_vpc_vpn.py
all_test_webhook_delivery Skipped --- test_webhook_delivery.py
all_test_webhook_lifecycle Skipped --- test_webhook_lifecycle.py
all_test_host_maintenance Skipped --- test_host_maintenance.py
all_test_hostha_kvm Skipped --- test_hostha_kvm.py
sureshanaparti commented 1 month ago

@blueorangutan package

blueorangutan commented 1 month ago

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 1 month ago

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11214

sureshanaparti commented 1 month ago

@blueorangutan test

blueorangutan commented 1 month ago

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

DaanHoogland commented 1 month ago

Tested. Both the status update and the prolonged shutdown time have been fixed by this.

blueorangutan commented 1 month ago

[SF] Trillian test result (tid-11558) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 66748 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9723-t11558-kvm-ol8.zip Smoke tests completed. 139 look OK, 2 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_01_migrate_VM_and_root_volume Error 136.04 test_vm_life_cycle.py
rohityadavcloud commented 1 month ago

cc @JoaoJandre - this has been tested, should we merge this?