Closed vdombrovski closed 4 months ago
thanks @vdombrovski , this will (obviously) be hard to reproduce in a lab environment. the true/false flag of the databag should ensure that IPs are removed, so it not being merged properly on unassigment is the issue as you already stated.
Do you have further logging from the VR that would help identifying the culprit?
(I'll try it in a lab env, but have little hope it will be reliably reproducible)
@DaanHoogland not sure if this helps, in /var/log/cloud.log I'm getting the same log repeated for all IPs:
2024-04-04 04:43:29,474 INFO Add: rule=-I VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT table=mangle
2024-04-04 04:43:29,475 INFO Executing: iptables -t mangle -I VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT
2024-04-04 04:43:30,158 INFO Executing: iptables -t mangle -D VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT
2024-04-04 04:43:30,162 INFO Delete rule -D VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT from table mangle
2024-04-04 04:43:36,448 INFO Address found in DataBag ==> {u'public_ip': u'XX.YY.25.156', u'one_to_one_nat': False, u'nic_dev_id': 2, u'network': u'XX.YY.25.0/24', u'first_i_p': False, u'netmask': u'255.255.255.0', u'source_nat': False, u'broadcast': u'XX.YY.25.255', u'add': True, u'is_private_gateway': False, u'nw_type': u'public', u'device': u'eth2', u'vif_mac_address': u'1e:00:63:00:04:b5', u'cidr': u'XX.YY.25.156/24', u'gateway': u'XX.YY.25.1', u'new_nic': False, u'size': u'24'}
2024-04-04 04:43:36,448 INFO Address XX.YY.25.156/24 on device eth2 already configured
2024-04-04 04:43:36,461 INFO Executing: arping -c 1 -I eth2 -A -U -s XX.YY.25.156 XX.YY.25.1
Regarding DataBag logs specifically (again in /var/log/cloud.log), on some instances we are getting this WARNING, without any other explanation:
2024-04-23 13:11:27,089 WARNING Unable to find and process databag for file: ip_associations.json.a81b47b8-2a91-42a2-a529-1de42849049c, for json type=ip_associations
I will setup a "brute-force" type test case that mirrors what I've described in "Steps to reproduce" using either CMK or Terraform; I'll let you know if I can get it to reproduce.
oh, that,
2024-04-23 13:11:27,089 WARNING Unable to find and process databag for file: ip_associations.json.a81b47b8-2a91-42a2-a529-1de42849049c, for json type=ip_associations
, sounds like a weird timing issue, as if two merge deamons are tryimg to process the file at the same time and one has already moved it...???
and it seems here,
2024-04-04 04:43:30,162 INFO Delete rule -D VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT from table mangle
2024-04-04 04:43:36,448 INFO Address found in DataBag ==> {u'public_ip': u'XX.YY.25.156', u'one_to_one_nat': False, u'nic_dev_id': 2, u'network': u'XX.YY.25.0/24', u'first_i_p': False, u'netmask': u'255.255.255.0', u'source_nat': False, u'broadcast': u'XX.YY.25.255', u'add': True, u'is_private_gateway': False, u'nw_type': u'public', u'device': u'eth2', u'vif_mac_address': u'1e:00:63:00:04:b5', u'cidr': u'XX.YY.25.156/24', u'gateway': u'XX.YY.25.1', u'new_nic': False, u'size': u'24'}
, that almost 0.3 second the after the deletion it is added again.
did a quick test, everything looks good this may need intensive testing to reproduce the issue
@DaanHoogland here is how I successfully reproduced the issue on our QA platform:
56 bytes from 1e:00:4d:00:00:c1 (XX.YY.17.30): index=959 time=275.810 usec
56 bytes from 1e:00:6b:00:00:c2 (XX.YY.17.30): index=960 time=337.038 usec
You were right, this is a timing issue. Here is the raw cmk code to reproduce:
Terminal 1:
while true; do bash ./test1.sh; done
Terminal 2:
while true; do bash ./test2.sh; done
#test1.sh
cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=22 privateport=22 name=vm-a
cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=22 privateport=22 name=vm-b
cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
#test2.sh
cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-a
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-a
cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-b
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-b
cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
Now I understand that handling concurrency isn't really a priority for this project (see another issue from last year: https://github.com/apache/cloudstack/issues/7907)
However, with the introduction of more vectors of automation (Terraform, Kubernetes), I believe that these types of issues might become really common. In our case, the IP conflicts have impacted more than 10 IPs total on a pool of 400 IPs, which is 2.5%.
Hi
We need that to be fix, how can we help ?
Best
Don't be putting the project down like that, concurrency is one of the main pillars of the project. We have not always been too good at it (as a project), admitted. Manpower vs priorities is the issue here, i think.
Thanks for the investigation @vdombrovski ! Can you do a test on 4.19 , or preferably main as well ?
@bragonznx , code investigation, design, PR, testing (big demand on that) are all needed. What can you offer ?
cheers guys.
@DaanHoogland as I said, this doesn't seem to be a priority at the current time, even though it might be one of the main pillars of the project. I'm not judging here, I'm simply stating the facts based on the change logs. I also know that people working on this project have put a lot of effort recently into enabling alternative L3-L7 solutions that might be more robust than the virtual router (such as the support of VNF appliance and Tungsten Fabric's successor), which might in term enable production-grade uses without the use of proprietary components.
Regarding testing the new release, I am not able to perform these tests on other versions as we don't have a QA platform in those releases (we have put updates on hold following all the migration issues that were reported by people on the users mailing list). We are currently waiting for the 4.18.2.0 to be released as our next upgrade candidate. Maybe is there a particular commit / diff / change from the tree that catches your eye that would help in solving this issue?
In the meantime, what could prevent this bug from occurring in a lab environment? Maybe this is something that a concurrent test case might catch?
We are currently waiting for the 4.18.2.0 to be released as our next upgrade candidate.
it has been released ;)
As for how to prevent it, I have no ready answer. When creating the situation at fault, does cloudstack in its DB have any proof of the double assignment?
@DaanHoogland yes correct, released after I wrote my comment, which means we will surely plan our tests now.
As for proof inside the DB, I've ran more tests, and I can most definitely introduce inconsistencies. For example, my last test got an IP with allocated=NULL but state=Allocated, which in practice means this ip is now unusable until I manually reset it in the database (it cannot be assigned, it cannot be released).
select * from user_ip_address where public_ip_address="XX.YY.17.30" \G;
*************************** 1. row ***************************
id: 107
uuid: f1988f51-c9e5-46e4-9393-2a873b2610e0
account_id: NULL
domain_id: NULL
public_ip_address: XX.YY.17.30
data_center_id: 1
source_nat: 0
allocated: NULL
vlan_db_id: 2
one_to_one_nat: 0
vm_id: NULL
state: Allocated
mac_address: 139
source_network_id: 200
network_id: 225
physical_network_id: 200
is_system: 0
vpc_id: NULL
dnat_vmip: NULL
is_portable: 0
display: 1
removed: NULL
created: NULL
rule_state: NULL
forsystemvms: 0
Note that all these cases are introduced without any modifications to my original test; I'm simply changing the timings of when I start test1.sh and test2.sh
I'll run some more tests, and report any additional anomalies I find.
Hello
Do we have any news on this ? I believe it's too important breaking production, to wait for a sponsor to cover this patch
Best regards
cc @weizhouapache @sureshanaparti to triage
@vdombrovski @bragonznx have you tried to reproduce this issue against CloudStack 4.18.2.0 or 4.19.0.1 ? Anyone is welcome to propose a fix.
@DaanHoogland here is how I successfully reproduced the issue on our QA platform:
56 bytes from 1e:00:4d:00:00:c1 (XX.YY.17.30): index=959 time=275.810 usec 56 bytes from 1e:00:6b:00:00:c2 (XX.YY.17.30): index=960 time=337.038 usec
You were right, this is a timing issue. Here is the raw cmk code to reproduce:
- Create 2 isolated networks, network-a and network-b (mine are ids c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 and 7e2aa0f1-c536-4162-be43-aea277ae80aa
- You don't need to add any VMs, only playing with virtual routers is enough to cause the issue
- Target a specific IP address, in my case XX.YY.17.30 (id=f1988f51-c9e5-46e4-9393-2a873b2610e0)
- Create 2 scripts as follows (make sure cmk works properly on your machine, I'm using 6.2.0)
- Run both scripts in parallel in a while true loop as follows:
Terminal 1:
while true; do bash ./test1.sh; done
Terminal 2:
while true; do bash ./test2.sh; done
- Test with arping on the IP (arping XX.YY.17.30), it should take around 2-3 mins on a platform with 0 load.
#test1.sh cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=22 privateport=22 name=vm-a cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0 cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=22 privateport=22 name=vm-b cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
#test2.sh cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-a cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-a cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0 cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-b cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-b cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
Now I understand that handling concurrency isn't really a priority for this project (see another issue from last year: #7907)
However, with the introduction of more vectors of automation (Terraform, Kubernetes), I believe that these types of issues might become really common. In our case, the IP conflicts have impacted more than 10 IPs total on a pool of 400 IPs, which is 2.5%.
thanks @vdombrovski
I adapted the two scripts you shared, and run them in parallel for around 2 hours with a public IP, two isolated networks, two vms
I did not face any issue.
I tested with latest main branch (4.20-SNAPSHOT), not 4.17.2
Can you share the full management-server.log for further investigation ?
@weizhouapache because it's a timing issue, you need to make sure that both scripts run with a delay, otherwise you are getting a favorable case where both operations target network A, then both target network B etc..
Try modifying test2 to first target network B, and add a random sleep after disassociating the IP; Both of these changes greatly help with reproducting the issue (I could reproduce again now in < 1min), like this (leave test1 the exact same):
#test2.sh
cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-b
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-b
cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
sleep $((RANDOM % 10))
cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-a
cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-a
cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
sleep $((RANDOM % 10))
As requested, here are the relevant logs from the manager, filtered for assignment+release of IP addresses, at the exact time the conflict occurs:
2024-05-02 14:32:41,747 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6674:ctx-b0b01e22) (logid:ace12149) ===START=== X.Y.191.80 -- GET account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:41,805 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6674:ctx-b0b01e22 ctx-083bb8ea ctx-1ad4ec66) (logid:ace12149) ===END=== X.Y.191.80 -- GET account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:41,837 DEBUG [c.c.n.IpAddressManagerImpl] (API-Job-Executor-30:ctx-4262738c job-2777 ctx-cd5a7ce3) (logid:eb86be5e) Successfully associated ip address XX.YY.17.30 to network Network {"id": 225, "name": "network-A", "uuid": "c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0", "networkofferingid": 9}
2024-05-02 14:32:41,850 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-30:ctx-4262738c job-2777 ctx-cd5a7ce3) (logid:eb86be5e) Complete async job-2777, jobStatus: SUCCEEDED, resultCode: 0, result: org.apache.cloudstack.api.response.IPAddressResponse/ipaddress/{"id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ipaddress":"XX.YY.17.30","allocated":"2024-05-02T14:32:41+0200","zoneid":"3adfd657-7e4e-4613-99da-49b3a0ef9db7","zonename":"ZONE","issourcenat":"false","projectid":"05019f40-8552-4856-9cf1-02ae1b7e2621","project":"testcpu","domainid":"e8858164-61cb-47ae-b05e-21163381e5ff","domain":"RND","forvirtualnetwork":"true","vlanid":"5c4f2de6-4555-4f8c-8af2-f2fd325bf101","vlanname":"vlan://untagged","isstaticnat":"false","issystem":"false","associatednetworkid":"c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0","associatednetworkname":"network-A","networkid":"08fb8c61-c810-487b-b931-1911a72a0f3e","state":"Allocated","physicalnetworkid":"cc445c30-6484-47e5-b13b-ffdd6554c02b","tags":[],"isportable":"false","fordisplay":"true","hasannotations":"false"}
2024-05-02 14:32:42,902 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6814:ctx-826cfc0d) (logid:08064c3d) ===START=== X.Y.191.80 -- GET account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:42,932 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6814:ctx-826cfc0d ctx-96b1e3fd ctx-d8a477c6) (logid:08064c3d) ===END=== X.Y.191.80 -- GET account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:48,278 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6705:ctx-3cafb699) (logid:b65f7570) ===START=== X.Y.191.80 -- GET apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:48,301 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (qtp1750498848-6705:ctx-3cafb699 ctx-59398c9e ctx-8f5a2116) (logid:b65f7570) submit async job-2780, details: AsyncJobVO: {id:2780, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70154","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:48,301 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-33:ctx-7c0ac951 job-2780) (logid:94060328) Executing AsyncJobVO: {id:2780, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70154","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-
46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:48,302 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6705:ctx-3cafb699 ctx-59398c9e ctx-8f5a2116) (logid:b65f7570) ===END=== X.Y.191.80 -- GET apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:51,347 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6807:ctx-ac7656e2) (logid:ae5760e6) ===START=== X.Y.191.80 -- GET apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:51,362 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (qtp1750498848-6807:ctx-ac7656e2 ctx-07d762b2 ctx-c606fd19) (logid:ae5760e6) submit async job-2782, details: AsyncJobVO: {id:2782, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70161","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:51,362 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-35:ctx-1248a5f3 job-2782) (logid:bb1f9468) Executing AsyncJobVO: {id:2782, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70161","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:51,363 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6807:ctx-ac7656e2 ctx-07d762b2 ctx-c606fd19) (logid:ae5760e6) ===END=== X.Y.191.80 -- GET apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:33:07,514 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-33:ctx-7c0ac951 job-2780) (logid:94060328) Complete async job-2780, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Failed to disassociate IP address"}
2024-05-02 14:32:41,747 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6674:ctx-b0b01e22) (logid:ace12149) ===START=== X.Y.191.80 -- GET account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:41,805 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6674:ctx-b0b01e22 ctx-083bb8ea ctx-1ad4ec66) (logid:ace12149) ===END=== X.Y.191.80 -- GET account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:41,837 DEBUG [c.c.n.IpAddressManagerImpl] (API-Job-Executor-30:ctx-4262738c job-2777 ctx-cd5a7ce3) (logid:eb86be5e) Successfully associated ip address XX.YY.17.30 to network Network {"id": 225, "name": "network-A", "uuid": "c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0", "networkofferingid": 9}
2024-05-02 14:32:41,850 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-30:ctx-4262738c job-2777 ctx-cd5a7ce3) (logid:eb86be5e) Complete async job-2777, jobStatus: SUCCEEDED, resultCode: 0, result: org.apache.cloudstack.api.response.IPAddressResponse/ipaddress/{"id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ipaddress":"XX.YY.17.30","allocated":"2024-05-02T14:32:41+0200","zoneid":"3adfd657-7e4e-4613-99da-49b3a0ef9db7","zonename":"ZONE","issourcenat":"false","projectid":"05019f40-8552-4856-9cf1-02ae1b7e2621","project":"testcpu","domainid":"e8858164-61cb-47ae-b05e-21163381e5ff","domain":"RND","forvirtualnetwork":"true","vlanid":"5c4f2de6-4555-4f8c-8af2-f2fd325bf101","vlanname":"vlan://untagged","isstaticnat":"false","issystem":"false","associatednetworkid":"c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0","associatednetworkname":"network-A","networkid":"08fb8c61-c810-487b-b931-1911a72a0f3e","state":"Allocated","physicalnetworkid":"cc445c30-6484-47e5-b13b-ffdd6554c02b","tags":[],"isportable":"false","fordisplay":"true","hasannotations":"false"}
2024-05-02 14:32:42,902 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6814:ctx-826cfc0d) (logid:08064c3d) ===START=== X.Y.191.80 -- GET account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:42,932 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6814:ctx-826cfc0d ctx-96b1e3fd ctx-d8a477c6) (logid:08064c3d) ===END=== X.Y.191.80 -- GET account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:48,278 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6705:ctx-3cafb699) (logid:b65f7570) ===START=== X.Y.191.80 -- GET apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:48,301 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (qtp1750498848-6705:ctx-3cafb699 ctx-59398c9e ctx-8f5a2116) (logid:b65f7570) submit async job-2780, details: AsyncJobVO: {id:2780, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70154","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:48,301 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-33:ctx-7c0ac951 job-2780) (logid:94060328) Executing AsyncJobVO: {id:2780, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70154","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:48,302 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6705:ctx-3cafb699 ctx-59398c9e ctx-8f5a2116) (logid:b65f7570) ===END=== X.Y.191.80 -- GET apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:51,347 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6807:ctx-ac7656e2) (logid:ae5760e6) ===START=== X.Y.191.80 -- GET apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:51,362 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (qtp1750498848-6807:ctx-ac7656e2 ctx-07d762b2 ctx-c606fd19) (logid:ae5760e6) submit async job-2782, details: AsyncJobVO: {id:2782, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70161","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:51,362 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-35:ctx-1248a5f3 job-2782) (logid:bb1f9468) Executing AsyncJobVO: {id:2782, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70161","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:51,363 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6807:ctx-ac7656e2 ctx-07d762b2 ctx-c606fd19) (logid:ae5760e6) ===END=== X.Y.191.80 -- GET apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
@rohityadavcloud do you have any fix of this in the changelog from 4.18 or 4.19 ? Because if it's not the case, I don't see the point you mentioned at all
We will reproduce this behavior on a 4.19 if need in due time, but for now we have 4 or 5 platform in 4.17 who can't be migrate because 4.18 didn't passed our testing scenario yet : each time we had regression
Best
We haven't any java development capability : we only have Python developer in our company. But I'm open to any proposal to fix this very annoying issue.
@vdombrovski @bragonznx after some hours, I am able to reproduce the issue twice
it looks like the issue is caused by concurrent actions
The action failed but the IP is marked as Free, but actually it is still configured in the VRs
To confirm it, can you check the management server log
egrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log
root@mgt-01:~# zgrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*
root@mgt-01:~#
root@mgt-01:~# egrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log
root@mgt-01:~#
zgrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*
@bragonznx
can you use zegrep
instead ?
zegrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*
root@mgt-01:~# zegrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*
root@mgt-01:~#
no matter the kind of grep you want to use, there is no occurrence in our production logs.
root@mgt-01:~# zegrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*
root@mgt-01:~#
no matter the kind of grep you want to use, there is no occurrence in our production logs.
@bragonznx if you have multiple management servers, can you check the logs on other management servers ?
zegrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*
is the pipe symbol within the single quotes a problem in this command?
maybe try
zegrep 'Failed to release.*Unable to revoke' /var/log/cloudstack/management/management-server.log.*
instead?
root@mgt-01:~# zegrep 'Failed to release.Unable to revoke' /var/log/cloudstack/management/management-server.log.
root@mgt-01:~#
root@mgt-01:~# zegrep 'Failed to release.Unable to revoke' /var/log/cloudstack/management/management-server.log.
root@mgt-01:~#
@bragonznx Can you check other management servers if any?
It's on management server where @vdombrovski did all the tests, so it should be here....
It's on management server where @vdombrovski did all the tests, so it should be here....
@bragonznx it would be good to upload the management server logs. if not possible, can you please search Exception/error in the logs ?
zegrep -i "exception|error" /var/log/cloudstack/management/management-server.log.*
What do you want to proof ? Because we have both got the issue of this IP conflicts issue. I can't spend my time digging in Exception/error, I have too much of them.
If you need me to copy past millions of "/var/log/cloudstack/management/management-server.log.2024-01-05.gz:2024-01-05 02:55:12,904 ERROR [c.c.s.StatsCollector] (StatsCollector-5:ctx-24d81e3e) (logid:81fce2ab)" you have to explain me the value, because for me it's around 0. We don't have to proof again the issue, everyone in the discussion know that it's real, and the issue is here, Vladimir replicated it several times, and you managed to do it on your side too.
What do you want to proof ? Because we have both got the issue of this IP conflicts issue. I can't spend my time digging in Exception/error, I have too much of them.
If you need me to copy past millions of "/var/log/cloudstack/management/management-server.log.2024-01-05.gz:2024-01-05 02:55:12,904 ERROR [c.c.s.StatsCollector] (StatsCollector-5:ctx-24d81e3e) (logid:81fce2ab)" you have to explain me the value, because for me it's around 0. We don't have to proof again the issue, everyone in the discussion know that it's real, and the issue is here, Vladimir replicated it several times, and you managed to do it on your side too.
@bragonznx
yes, I have reproduced the issue.
However, it looks like the root cause is different from yours, because I saw some errors like
Failed to release resources for ip address
and Unable to revoke all the firewall rules for ip
sometimes in my log, but you did not find it in your logs.
I did not expect you dig into the management server logs. I just requested you to share some management server log as the logs might give some clue what the root cause is.
It is up to you upload or not upload the management server logs (of the days when @vdombrovski tested).
What do you want to proof ?
@bragonznx , if you really want to provide "proof", it would need to be a fix. This is open source and Wei is trying to figure it out , out of interest.
Because we have both got the issue of this IP conflicts issue. I can't spend my time digging in Exception/error, I have too much of them.
Neither can anybody else spend the time for you, if you are not willing.
If you need me to copy past millions of "/var/log/cloudstack/management/management-server.log.2024-01-05.gz:2024-01-05 02:55:12,904 ERROR [c.c.s.StatsCollector] (StatsCollector-5:ctx-24d81e3e) (logid:81fce2ab)" you have to explain me the value, because for me it's around 0. We don't have to proof again the issue, everyone in the discussion know that it's real, and the issue is here, Vladimir replicated it several times, and you managed to do it on your side too.
Nobody is obliged to care about your issue either! Again, this is open source and interested people are willing to help you solve it, but please pay them respect.
"Nobody is obligated to care about your issue. However, in open source, there are individuals who are genuinely interested in helping you resolve it. It's important to show them respect."
Can you point out where we have been disrespectful? I personally spent a lot of time digging within overall all our cloudstack platform to give you the fact that the logs are full with nothing you are looking about ! How it is isrespectful, can you highlight me ?
We have provided all the necessary materials, evidence, scripts, and steps to reproduce the issue across multiple tickets.
Both of us rely heavily on CloudStack, an "open-source project", to generate revenue through our company's services. I am capable of changing projects if necessary; after all, we use libvirt extensively, and I'm well-versed in it. So, we as a company aren't stuck. I can pivot in just two minutes and chart a new roadmap to go out.
On your end, you offer CloudStack development services and expertise, but there's an issue online that portrays you in a negative light. It appears as if you're dismissing the efforts of those who are trying to highlight design flaws within CloudStack. Again we aren't developpers company we are a cloud company and you provide a cloud solution, so when a cloud provider with many platform in production give you a lot of evidences of miss behavior, keep in mind that we aren't doing this on our week end !
How do you think the community will perceive this ?
Your behavior could be seen as belittling users, especially regarding a major asynchronous design flaw. Whenever someone searches for a solution to a public IP issue, they will come across this problem, and it will be widely circulated.
We're willing to fund a patch, but let's refrain from accusing others of disrespect. Using the open-source flag as a shield here is risky. You have my contact if you need to reach me, we already bought professional services ...
Best regards without any disrespect !
Hello @bragonznx,
Thank you and @vdombrovski for reporting the situation and providing solid and concrete information for us to reproduce the issue; and I am sorry that you guys are not having a good experience while reporting an issue.
@hsato03 and I were able to reproduce the situation and we are now analyzing the workflows to fix it. We hope to open a PR shortly to share our findings and tests
@hsato03 and I were able to reproduce the situation and we are now analyzing the workflows to fix it. We hope to open a PR shortly to share our findings and tests
@GutoVeronezi can you share the details of the issue you encountered ?
if you get the error Failed to release
and Unable to revoke
in the management server log, the issue you faced might be same as what I reproduced, but not same as the reporter faced.
Hello @weizhouapache, sorry for the late reply.
I ran the script again to fetch your logs, I can confirm that:
For example, I'm getting these logs:
tail -f /var/log/cloudstack/management/management-server.log | grep -E "Failed to release|Unable to revoke"
2024-05-13 11:35:04,879 WARN [c.c.n.IpAddressManagerImpl] (API-Job-Executor-87:ctx-e73f179d job-2834 ctx-d655a156) (logid:f8ef96fd) Unable to revoke all the firewall rules for ip id=107 as a part of ip release
2024-05-13 11:35:22,448 WARN [c.c.n.IpAddressManagerImpl] (API-Job-Executor-87:ctx-e73f179d job-2834 ctx-d655a156) (logid:f8ef96fd) Failed to release resources for ip address id=107
2024-05-13 11:35:22,483 WARN [c.c.n.NetworkServiceImpl] (API-Job-Executor-87:ctx-e73f179d job-2834 ctx-d655a156) (logid:f8ef96fd) Failed to release public ip address id=107
2024-05-13 11:37:15,709 WARN [c.c.n.IpAddressManagerImpl] (API-Job-Executor-112:ctx-95282fbe job-2859 ctx-4f1f38f7) (logid:ee6bbf3d) Unable to revoke all the firewall rules for ip id=107 as a part of ip release
2024-05-13 11:37:33,802 WARN [c.c.n.IpAddressManagerImpl] (API-Job-Executor-112:ctx-95282fbe job-2859 ctx-4f1f38f7) (logid:ee6bbf3d) Failed to release resources for ip address id=107
2024-05-13 11:37:33,841 WARN [c.c.n.NetworkServiceImpl] (API-Job-Executor-112:ctx-95282fbe job-2859 ctx-4f1f38f7) (logid:ee6bbf3d) Failed to release public ip address id=107
2024-05-13 11:38:43,386 WARN [c.c.n.IpAddressManagerImpl] (API-Job-Executor-4:ctx-a99870ee job-2876 ctx-15237616) (logid:5632f157) Unable to revoke all the firewall rules for ip id=107 as a part of ip release
2024-05-13 11:38:53,873 WARN [c.c.n.IpAddressManagerImpl] (API-Job-Executor-4:ctx-a99870ee job-2876 ctx-15237616) (logid:5632f157) Failed to release resources for ip address id=107
2024-05-13 11:39:00,601 WARN [c.c.n.NetworkServiceImpl] (API-Job-Executor-4:ctx-a99870ee job-2876 ctx-15237616) (logid:5632f157) Failed to release public ip address id=107
2024-05-13 11:41:42,867 WARN [c.c.n.IpAddressManagerImpl] (API-Job-Executor-41:ctx-06acb0e1 job-2913 ctx-63672a38) (logid:477fad2b) Unable to revoke all the firewall rules for ip id=107 as a part of ip release
2024-05-13 11:42:00,821 WARN [c.c.n.IpAddressManagerImpl] (API-Job-Executor-41:ctx-06acb0e1 job-2913 ctx-63672a38) (logid:477fad2b) Failed to release resources for ip address id=107
2024-05-13 11:42:00,848 WARN [c.c.n.NetworkServiceImpl] (API-Job-Executor-41:ctx-06acb0e1 job-2913 ctx-63672a38) (logid:477fad2b) Failed to release public ip address id=107
However, there is no IP conflict introduced when this log appears (arping returns a single MAC address).
Similarly, when the IP conflict appears, this log is not present at all (at least in the 3-4 last iterations which lead to the conflict in my tests).
I think there are 2 separate issues here:
Hopefully these details help this investigation.
thanks @vdombrovski same as suspected, I think your issue is different as what I faced in my testing (should be fixed by #9059).
@hsato03 are you working on this? I see you assigned but also an open PR by @weizhouapache : #9059
@hsato03 are you working on this? I see you assigned but also an open PR by @weizhouapache : #9059
@DaanHoogland The issue I fixed in #9059 is not exactly the same as @vdombrovski reported.
@hsato03 @GutoVeronezi said they have reproduced the issue, let's wait for their PR.
Hello @bragonznx @vdombrovski
We reproduced the issue using the script you provided in this discussion and also discovered that this situation also happens when creating a remote access VPN. The situation we found was happening due to race conditions, as you guys stated in the issue. We created PR #9234 to improve some IP-related APIs that were prone to race conditions. Could you guys check and try the PR? Also, if you find anything else or your situation happens to be different, just let me know and I will gladly try to help you.
@bragonznx @vdombrovski
We also discovered that the public IP can be duplicated when different VRs have the same control IP. Could you check the database to see that there are no duplicate control IPs on your VRs?
Hi @vdombrovski Can you test the PR #9234 & confirm if the IP conflicts / race conditions are resolved or not.
Hello @sureshanaparti, I would love to test this patch, however I can't apply it against the 4.17.2.0 branch:
git apply 9234.patch
error: patch failed: server/src/main/java/com/cloud/network/IpAddressManagerImpl.java:725
error: server/src/main/java/com/cloud/network/IpAddressManagerImpl.java: patch does not apply
error: patch failed: server/src/main/java/com/cloud/network/lb/LoadBalancingRulesManagerImpl.java:1814
error: server/src/main/java/com/cloud/network/lb/LoadBalancingRulesManagerImpl.java: patch does not apply
Like described before, I do not currently have a running 4.19 Cloudstack setup because of migration issues, so my tests are limited to the 4.17 release
Thanks @vdombrovski, Can you review the PR (and check if it covers the IP conflict fixes).
Hello @vdombrovski,
In order for you to test, I made the backport of #9234 to the 4.17.2 version in my repository. You can clone it by using the following command:
git clone -b 4.17.2-fix-ip-race-conditions git@github.com:hsato03/cloudstack.git
Hello @hsato03,
Thank you for the backport, I've tested the patch, this looks good, no more IP conflicts detected, and all rules still apply properly as before (as much as possible at least). I definitely see some NPEs now though, in case this is something you are trying to prevent, see examples below (might be out of scope for this issue).
Anyway thank you for the amazing work on this subject.
2024-06-27 17:23:09,634 WARN [o.a.c.a.c.u.l.CreateLoadBalancerRuleCmd] (API-Job-Executor-98:ctx-caca6b4e job-3390 ctx-cbaca4cc) (logid:518a1188) Failed to create LB rule due to exception
java.lang.NullPointerException
at com.cloud.network.firewall.FirewallManagerImpl.applyRules(FirewallManagerImpl.java:594)
at com.cloud.network.firewall.FirewallManagerImpl.applyFirewallRules(FirewallManagerImpl.java:700)
at com.cloud.network.firewall.FirewallManagerImpl.applyIngressFirewallRules(FirewallManagerImpl.java:671)
at jdk.internal.reflect.GeneratedMethodAccessor227.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
at com.sun.proxy.$Proxy258.applyIngressFirewallRules(Unknown Source)
at org.apache.cloudstack.api.command.user.loadbalancer.CreateLoadBalancerRuleCmd.execute(CreateLoadBalancerRuleCmd.java:280)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163)
at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:106)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620)
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
and
2024-06-27 17:23:13,637 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-99:ctx-f17bb0ec job-3391) (logid:61dae7ad) Unexpected exception while executing org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd
java.lang.NullPointerException
at com.cloud.network.lb.LoadBalancingRulesManagerImpl.deleteLoadBalancerRule(LoadBalancingRulesManagerImpl.java:1484)
at com.cloud.network.lb.LoadBalancingRulesManagerImpl.removeAllLoadBalanacersForIp(LoadBalancingRulesManagerImpl.java:2031)
at jdk.internal.reflect.GeneratedMethodAccessor231.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
at com.sun.proxy.$Proxy178.removeAllLoadBalanacersForIp(Unknown Source)
at com.cloud.network.IpAddressManagerImpl.cleanupIpResources(IpAddressManagerImpl.java:682)
at com.cloud.network.IpAddressManagerImpl.disassociatePublicIpAddress(IpAddressManagerImpl.java:716)
at com.cloud.network.NetworkServiceImpl.releaseIpAddressInternal(NetworkServiceImpl.java:1119)
at com.cloud.network.NetworkServiceImpl.releaseIpAddress(NetworkServiceImpl.java:1063)
at jdk.internal.reflect.GeneratedMethodAccessor282.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:107)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
at com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:52)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
at com.sun.proxy.$Proxy170.releaseIpAddress(Unknown Source)
at org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd.execute(DisassociateIPAddrCmd.java:80)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163)
at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:106)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620)
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
@vdombrovski Thank you for testing and pointing that out. I will address these NPEs in #9234.
I'm glad this patch helped you.
Thanks @vdombrovski @hsato03. Fixed in #9234.
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
Note: I believe this impacts releases 4.18 and 4.19 also as no relevant code changes have been made to VR code
CONFIGURATION
OS / ENVIRONMENT
Ubuntu Focal 20.04
SUMMARY
Our workflow makes it so public IPs can be assigned/unassigned/reassigned to different networks, possibly multiple times per day (mostly, but not only via https://github.com/apache/cloudstack-kubernetes-provider)
After a while, we started seeing that many unused IP addresses (according to ACS database) were actually answering to L2 ARPing. In parallel, some of our IPs are getting 2 different MACs from the ping:
Our investigation has shown these conflicts on around 10 ips in two of our /24 subnets, which looks too much to be a conicidence.
We have confirmed that both MAC addresses belong to alive and healthy virtual routers:
These routers both correctly mount the IP addresses:
Inside the "databag" of the so called "illegitimate" router causing the conflict, we can see the IP declared (sometimes with flag add: true, sometimes with add: false):
To my understanding, this file is read/merged whenever a new IP state change is requested; however it's not merged properly on unassignment.
Obviously, deleting/recreating all the virtual routers would solve the issue as this cache would be cleared; however because this causes massive disruptions on the service we would like to make sure the root cause is fixed first.
STEPS TO REPRODUCE
No specific command line here, but I think trying to bounce a public IP between 2 guest routers (ofc inside separate VPCs if these are VPC networks) several time would results in this issue.
EXPECTED RESULTS
The IP is properly garbage collected on the router that doesn't hold it anymore
ACTUAL RESULTS
IP address conflcit