apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.83k stars 1.07k forks source link

Prevent network disruption on adding a VPC tier for redundant VRs #9251

Closed vishesh92 closed 2 weeks ago

vishesh92 commented 2 weeks ago

Description

This PR fixes https://github.com/apache/cloudstack/issues/8108

As of now, on adding a new network tier to a VPC, results in a small network disruption because of keepalived's config reload on both the VRs. This patch fixes this by doing operations in this order:

  1. Stop keepalived on all the backup VRs. This prevents the backup VR from becoming master when keepalived on the PRIMARY VR is getting reloaded.
  2. Update the config and reload on all primary VRs.
  3. Update the config and restart on all backup VRs.

Logs before patch on Primary VR

tail -f /var/log/messages | grep keepalived
Jun 13 11:53:31 systemvm Keepalived_vrrp[10528]: (inside_network) Entering BACKUP STATE
Jun 13 11:54:55 systemvm Keepalived_vrrp[10528]: (inside_network) Backup received priority 0 advertisement
Jun 13 11:54:56 systemvm Keepalived_vrrp[10528]: (inside_network) Entering MASTER STATE
Jun 13 12:14:00 systemvm Keepalived[10527]: Reloading ...
Jun 13 12:14:00 systemvm Keepalived[10527]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 13 12:14:00 systemvm Keepalived_vrrp[10528]: Reloading
Jun 13 12:14:00 systemvm Keepalived_vrrp[10528]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 13 12:14:00 systemvm Keepalived_vrrp[10528]: VRRP_Script(heartbeat) considered successful on reload
Jun 13 12:14:00 systemvm Keepalived_vrrp[10528]: SECURITY VIOLATION - scripts are being executed but script_security not enabled.
Jun 13 12:14:03 systemvm Keepalived_vrrp[10528]: (inside_network) received an unexpected ip number count 2, expected 3!
Jun 13 12:14:04 systemvm Keepalived_vrrp[10528]: (inside_network) received an unexpected ip number count 2, expected 3!
Jun 13 12:14:05 systemvm Keepalived_vrrp[10528]: (inside_network) IPSEC-AH : sequence number 4009 already processed. Packet dropped. Local(4009)
Jun 13 12:14:07 systemvm Keepalived_vrrp[10528]: (inside_network) received an unexpected ip number count 2, expected 3!
Jun 13 12:14:07 systemvm Keepalived_vrrp[10528]: (inside_network) Received advert from 172.31.1.88 with lower priority 100, ours 100, forcing new election
Jun 13 12:14:07 systemvm Keepalived_vrrp[10528]: (inside_network) IPSEC-AH : Syncing seq_num - Increment seq

Logs before patch on Backup VR

Jun 13 11:54:56 systemvm Keepalived_vrrp[10628]: (inside_network) Entering BACKUP STATE
Jun 13 12:05:18 systemvm Keepalived_vrrp[10628]: (inside_network) IPSEC-AH : sequence number 3479 already processed. Packet dropped. Local(3479)
Jun 13 12:14:00 systemvm Keepalived_vrrp[10628]: (inside_network) received an unexpected ip number count 3, expected 2!
Jun 13 12:14:01 systemvm Keepalived_vrrp[10628]: (inside_network) received an unexpected ip number count 3, expected 2!
Jun 13 12:14:02 systemvm Keepalived_vrrp[10628]: (inside_network) received an unexpected ip number count 3, expected 2!
Jun 13 12:14:03 systemvm Keepalived_vrrp[10628]: (inside_network) received an unexpected ip number count 3, expected 2!
Jun 13 12:14:03 systemvm Keepalived_vrrp[10628]: (inside_network) Entering MASTER STATE
Jun 13 12:14:05 systemvm Keepalived_vrrp[10628]: (inside_network) IPSEC-AH : sequence number 4009 already processed. Packet dropped. Local(4009)
Jun 13 12:14:06 systemvm Keepalived_vrrp[10628]: (inside_network) received an unexpected ip number count 3, expected 2!
Jun 13 12:14:07 systemvm Keepalived[10627]: Reloading ...
Jun 13 12:14:07 systemvm Keepalived[10627]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 13 12:14:07 systemvm Keepalived_vrrp[10628]: Reloading
Jun 13 12:14:07 systemvm Keepalived_vrrp[10628]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 13 12:14:07 systemvm Keepalived_vrrp[10628]: VRRP_Script(heartbeat) considered successful on reload
Jun 13 12:14:07 systemvm Keepalived_vrrp[10628]: SECURITY VIOLATION - scripts are being executed but script_security not enabled.
Jun 13 12:14:07 systemvm Keepalived_vrrp[10628]: (inside_network) Master received advert from 172.31.1.236 with same priority 100 but higher IP address than ours
Jun 13 12:14:07 systemvm Keepalived_vrrp[10628]: (inside_network) Entering BACKUP STATE

Logs after patch on Primary VR

Jun 14 10:23:04 systemvm Keepalived[116549]: Reloading ...
Jun 14 10:23:04 systemvm Keepalived[116549]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 14 10:23:04 systemvm Keepalived_vrrp[116550]: Reloading
Jun 14 10:23:04 systemvm Keepalived_vrrp[116550]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 14 10:23:04 systemvm Keepalived_vrrp[116550]: VRRP_Script(heartbeat) considered successful on reload
Jun 14 10:23:04 systemvm Keepalived_vrrp[116550]: SECURITY VIOLATION - scripts are being executed but script_security not enabled.

Logs after patch on Backup VR

Jun 14 10:23:00 systemvm Keepalived[133456]: Stopping
Jun 14 10:23:01 systemvm Keepalived_vrrp[133462]: Stopped
Jun 14 10:23:01 systemvm Keepalived[133456]: Stopped Keepalived v2.1.5 (07/13,2020)
Jun 14 10:23:08 systemvm Keepalived[134884]: Starting Keepalived v2.1.5 (07/13,2020)
Jun 14 10:23:08 systemvm Keepalived[134884]: WARNING - keepalived was build for newer Linux 5.10.70, running on Linux 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29)
Jun 14 10:23:08 systemvm Keepalived[134884]: Command line: '/usr/sbin/keepalived' '--dont-fork'
Jun 14 10:23:08 systemvm Keepalived[134884]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 14 10:23:08 systemvm Keepalived[134884]: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Jun 14 10:23:08 systemvm Keepalived[134884]: Starting VRRP child process, pid=134885
Jun 14 10:23:08 systemvm Keepalived_vrrp[134885]: Registering Kernel netlink reflector
Jun 14 10:23:08 systemvm Keepalived_vrrp[134885]: Registering Kernel netlink command channel
Jun 14 10:23:08 systemvm Keepalived_vrrp[134885]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 14 10:23:08 systemvm Keepalived_vrrp[134885]: WARNING - default user 'keepalived_script' for script execution does not exist - please create.
Jun 14 10:23:08 systemvm Keepalived_vrrp[134885]: SECURITY VIOLATION - scripts are being executed but script_security not enabled.
Jun 14 10:23:08 systemvm Keepalived_vrrp[134885]: Registering gratuitous ARP shared channel
Jun 14 10:23:08 systemvm Keepalived_vrrp[134885]: VRRP_Script(heartbeat) succeeded
Jun 14 10:23:08 systemvm Keepalived_vrrp[134885]: (inside_network) Entering BACKUP STATE

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

vishesh92 commented 2 weeks ago

@blueorangutan package

blueorangutan commented 2 weeks ago

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 2 weeks ago

Packaging result [SF]: ✖️ el7 ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 9935

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 0% with 95 lines in your changes missing coverage. Please review.

Project coverage is 14.95%. Comparing base (034a5c8) to head (241d962). Report is 11 commits behind head on 4.19.

Files Patch % Lines
...cloud/network/element/VpcVirtualRouterElement.java 0.00% 33 Missing :warning:
.../router/VpcVirtualNetworkApplianceManagerImpl.java 0.00% 33 Missing :warning:
...cloudstack/agent/routing/ManageServiceCommand.java 0.00% 16 Missing :warning:
...esource/virtualnetwork/VirtualRoutingResource.java 0.00% 13 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## 4.19 #9251 +/- ## ============================================ + Coverage 4.29% 14.95% +10.66% - Complexity 0 11012 +11012 ============================================ Files 363 5379 +5016 Lines 29374 470006 +440632 Branches 5138 58140 +53002 ============================================ + Hits 1261 70284 +69023 - Misses 27970 391938 +363968 - Partials 143 7784 +7641 ``` | [Flag](https://app.codecov.io/gh/apache/cloudstack/pull/9251/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [uitests](https://app.codecov.io/gh/apache/cloudstack/pull/9251/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `4.29% <ø> (-0.01%)` | :arrow_down: | | [unittests](https://app.codecov.io/gh/apache/cloudstack/pull/9251/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `15.66% <0.00%> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

vishesh92 commented 2 weeks ago

@blueorangutan package

blueorangutan commented 2 weeks ago

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 2 weeks ago

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9936

vishesh92 commented 2 weeks ago

@blueorangutan package

blueorangutan commented 2 weeks ago

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 2 weeks ago

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9949

weizhouapache commented 2 weeks ago

@blueorangutan package

blueorangutan commented 2 weeks ago

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 2 weeks ago

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9957

vishesh92 commented 2 weeks ago

@blueorangutan test

blueorangutan commented 2 weeks ago

@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan commented 2 weeks ago

[SF] Trillian test result (tid-10454) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 43308 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9251-t10454-kvm-centos7.zip Smoke tests completed. 130 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_trigger_shutdown Failure 341.63 test_safe_shutdown.py
vishesh92 commented 2 weeks ago

@blueorangutan package

blueorangutan commented 2 weeks ago

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 2 weeks ago

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9974

DaanHoogland commented 2 weeks ago

@blueorangutan test

blueorangutan commented 2 weeks ago

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan commented 2 weeks ago

[SF] Trillian test result (tid-10466) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 42653 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9251-t10466-kvm-centos7.zip Smoke tests completed. 130 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_trigger_shutdown Failure 341.72 test_safe_shutdown.py
DaanHoogland commented 2 weeks ago

tested in a lab env: