apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.83k stars 1.07k forks source link

Fix for race when automatically assigning IP to Vms #9240

Closed abh1sar closed 5 days ago

abh1sar commented 2 weeks ago

Description

Fixes: #7907

This PR fixes the issue where two VMs can be assigned the same IP if they are created at the same time. NetworkOrchestrator.allocateNic() calls guru.alocate() which returns a free IP address. NetworkOrchestrator.allocateNic() then calls _nicDao.persists()

But guru.allocate() can return same IPs to two VMs if the first VM hasn't persisted the nic to the DB yet. Doing the whole thing in a transaction might be costly.

So the fix is to check if the IP returned by guru.allocate is already assigned just before persisting the NicVO in a transaction. Check will be done only for cases where Ipv4 allocation might race.

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Wasn't possible to reproduce the actual race, but I tested by manually setting values with debugger and verifying that the code does what it is supposed to.

How did you try to break this feature and the system with this change?

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 0% with 48 lines in your changes missing coverage. Please review.

Project coverage is 14.98%. Comparing base (b2ef53b) to head (1fb61ae). Report is 70 commits behind head on 4.19.

Files Patch % Lines
...tack/engine/orchestration/NetworkOrchestrator.java 0.00% 41 Missing :warning:
api/src/main/java/com/cloud/vm/NicProfile.java 0.00% 6 Missing :warning:
.../java/com/cloud/network/guru/GuestNetworkGuru.java 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## 4.19 #9240 +/- ## ============================================ + Coverage 14.96% 14.98% +0.01% - Complexity 11013 11048 +35 ============================================ Files 5377 5389 +12 Lines 469567 470615 +1048 Branches 60162 57503 -2659 ============================================ + Hits 70285 70517 +232 - Misses 391498 392263 +765 - Partials 7784 7835 +51 ``` | [Flag](https://app.codecov.io/gh/apache/cloudstack/pull/9240/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [uitests](https://app.codecov.io/gh/apache/cloudstack/pull/9240/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `4.28% <ø> (-0.02%)` | :arrow_down: | | [unittests](https://app.codecov.io/gh/apache/cloudstack/pull/9240/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `15.69% <0.00%> (+0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

DaanHoogland commented 2 weeks ago

Check will be done only for cases where Ipv4 allocation might race.

right now, this means when allocating for the GuestNetworkGuru. Can you look at @hsato03 's implementation (together with him) to see if it can be unified?

rohityadavcloud commented 1 week ago

@abh1sar can you review and address outstanding comments? And, can you care to run packaging and smoketests for your own PRs that are ready for review. @blueorangutan package

blueorangutan commented 1 week ago

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 1 week ago

Packaging result [SF]: ✖️ el7 ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 10097

sureshanaparti commented 1 week ago

@blueorangutan package

blueorangutan commented 1 week ago

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 1 week ago

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10101

abh1sar commented 1 week ago

@blueorangutan package

blueorangutan commented 1 week ago

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan commented 1 week ago

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10135

abh1sar commented 1 week ago

@blueorangutan test

blueorangutan commented 1 week ago

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan commented 6 days ago

[SF] Trillian test result (tid-10631) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 41624 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9240-t10631-kvm-centos7.zip Smoke tests completed. 131 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
weizhouapache commented 6 days ago

I tried to reproduce the issue on a env without this PR

image

2 vms failed, 4 vms succeeded (3 vms have the same IP)

With this PR (on another env)