apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.1k stars 1.11k forks source link

IP address conflicts held by public IPs of virtual routers #8967

Closed vdombrovski closed 4 months ago

vdombrovski commented 6 months ago
ISSUE TYPE
COMPONENT NAME
VR
CLOUDSTACK VERSION
4.17.2

Note: I believe this impacts releases 4.18 and 4.19 also as no relevant code changes have been made to VR code

CONFIGURATION
OS / ENVIRONMENT

Ubuntu Focal 20.04

SUMMARY

Our workflow makes it so public IPs can be assigned/unassigned/reassigned to different networks, possibly multiple times per day (mostly, but not only via https://github.com/apache/cloudstack-kubernetes-provider)

After a while, we started seeing that many unused IP addresses (according to ACS database) were actually answering to L2 ARPing. In parallel, some of our IPs are getting 2 different MACs from the ping:

ARPING XX.YY.25.156
60 bytes from 1e:00:fe:00:04:ac (XX.YY.25.156): index=0 time=306.714 usec
60 bytes from 1e:00:63:00:04:b5 (XX.YY.25.156): index=1 time=342.809 usec

Our investigation has shown these conflicts on around 10 ips in two of our /24 subnets, which looks too much to be a conicidence.

We have confirmed that both MAC addresses belong to alive and healthy virtual routers:

 [cloud]> select instance_id from nics where mac_address="1e:00:63:00:04:b5";
+-------------+
| instance_id |
+-------------+
|        5654 |
+-------------+
1 row in set (0.010 sec)

[cloud]> select instance_id from nics where mac_address="1e:00:fe:00:04:ac";
+-------------+
| instance_id |
+-------------+
|        5624 |
+-------------+
1 row in set (0.007 sec)

These routers both correctly mount the IP addresses:

ip a | grep XX.YY.25.156
    inet XX.YY.25.156/24 brd XX.YY.25.255 scope global secondary eth2

Inside the "databag" of the so called "illegitimate" router causing the conflict, we can see the IP declared (sometimes with flag add: true, sometimes with add: false):

# /etc/cloudstack/ips.json
    {
      "add": true,
      "broadcast": "XX.YY.25.255",
      "cidr": "XX.YY.25.156/24",
      "device": "eth2",
      "first_i_p": false,
      "gateway": "XX.YY.25.1",
      "is_private_gateway": false,
      "netmask": "255.255.255.0",
      "network": "XX.YY.25.0/24",
      "new_nic": false,
      "nic_dev_id": 2,
      "nw_type": "public",
      "one_to_one_nat": false,
      "public_ip": "XX.YY.25.156",
      "size": "24",
      "source_nat": false,
      "vif_mac_address": "1e:00:63:00:04:b5"
    }

To my understanding, this file is read/merged whenever a new IP state change is requested; however it's not merged properly on unassignment.

Obviously, deleting/recreating all the virtual routers would solve the issue as this cache would be cleared; however because this causes massive disruptions on the service we would like to make sure the root cause is fixed first.

STEPS TO REPRODUCE

No specific command line here, but I think trying to bounce a public IP between 2 guest routers (ofc inside separate VPCs if these are VPC networks) several time would results in this issue.

  1. Create 2 guest networks A and B
  2. Create 2 VMS, one inside network A, the other inside network B
  3. Associate an IP address to network A, and create a loadbalancer rule for some port of VM A (should also work with port-forwarding and static NAT)
  4. Release the IP, associate it to network B, and create a loadbalancer rule for some port of VM A (should also work with port-forwarding and static NAT)
  5. Repeat steps 3-4 until you get an IP address conflict (or until you see invalid entries in the databag)
EXPECTED RESULTS

The IP is properly garbage collected on the router that doesn't hold it anymore

ACTUAL RESULTS

IP address conflcit

DaanHoogland commented 6 months ago

thanks @vdombrovski , this will (obviously) be hard to reproduce in a lab environment. the true/false flag of the databag should ensure that IPs are removed, so it not being merged properly on unassigment is the issue as you already stated.

Do you have further logging from the VR that would help identifying the culprit?

(I'll try it in a lab env, but have little hope it will be reliably reproducible)

vdombrovski commented 6 months ago

@DaanHoogland not sure if this helps, in /var/log/cloud.log I'm getting the same log repeated for all IPs:

2024-04-04 04:43:29,474 INFO     Add: rule=-I VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT table=mangle
2024-04-04 04:43:29,475 INFO     Executing: iptables -t mangle -I VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT
2024-04-04 04:43:30,158 INFO     Executing: iptables -t mangle -D VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT
2024-04-04 04:43:30,162 INFO     Delete rule -D VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT from table mangle
2024-04-04 04:43:36,448 INFO     Address found in DataBag ==> {u'public_ip': u'XX.YY.25.156', u'one_to_one_nat': False, u'nic_dev_id': 2, u'network': u'XX.YY.25.0/24', u'first_i_p': False, u'netmask': u'255.255.255.0', u'source_nat': False, u'broadcast': u'XX.YY.25.255', u'add': True, u'is_private_gateway': False, u'nw_type': u'public', u'device': u'eth2', u'vif_mac_address': u'1e:00:63:00:04:b5', u'cidr': u'XX.YY.25.156/24', u'gateway': u'XX.YY.25.1', u'new_nic': False, u'size': u'24'}
2024-04-04 04:43:36,448 INFO     Address XX.YY.25.156/24 on device eth2 already configured
2024-04-04 04:43:36,461 INFO     Executing: arping -c 1 -I eth2 -A -U -s XX.YY.25.156 XX.YY.25.1

Regarding DataBag logs specifically (again in /var/log/cloud.log), on some instances we are getting this WARNING, without any other explanation:

2024-04-23 13:11:27,089 WARNING  Unable to find and process databag for file: ip_associations.json.a81b47b8-2a91-42a2-a529-1de42849049c, for json type=ip_associations

I will setup a "brute-force" type test case that mirrors what I've described in "Steps to reproduce" using either CMK or Terraform; I'll let you know if I can get it to reproduce.

DaanHoogland commented 6 months ago

oh, that,

2024-04-23 13:11:27,089 WARNING  Unable to find and process databag for file: ip_associations.json.a81b47b8-2a91-42a2-a529-1de42849049c, for json type=ip_associations

, sounds like a weird timing issue, as if two merge deamons are tryimg to process the file at the same time and one has already moved it...???

and it seems here,

2024-04-04 04:43:30,162 INFO     Delete rule -D VPN_XX.YY.25.156 -m state --state RELATED,ESTABLISHED -j ACCEPT from table mangle
2024-04-04 04:43:36,448 INFO     Address found in DataBag ==> {u'public_ip': u'XX.YY.25.156', u'one_to_one_nat': False, u'nic_dev_id': 2, u'network': u'XX.YY.25.0/24', u'first_i_p': False, u'netmask': u'255.255.255.0', u'source_nat': False, u'broadcast': u'XX.YY.25.255', u'add': True, u'is_private_gateway': False, u'nw_type': u'public', u'device': u'eth2', u'vif_mac_address': u'1e:00:63:00:04:b5', u'cidr': u'XX.YY.25.156/24', u'gateway': u'XX.YY.25.1', u'new_nic': False, u'size': u'24'}

, that almost 0.3 second the after the deletion it is added again.

weizhouapache commented 6 months ago

did a quick test, everything looks good this may need intensive testing to reproduce the issue

vdombrovski commented 6 months ago

@DaanHoogland here is how I successfully reproduced the issue on our QA platform:

56 bytes from 1e:00:4d:00:00:c1 (XX.YY.17.30): index=959 time=275.810 usec
56 bytes from 1e:00:6b:00:00:c2 (XX.YY.17.30): index=960 time=337.038 usec

You were right, this is a timing issue. Here is the raw cmk code to reproduce:

Terminal 1:

while true; do bash ./test1.sh; done

Terminal 2:

while true; do bash ./test2.sh; done
#test1.sh

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=22 privateport=22 name=vm-a

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=22 privateport=22 name=vm-b

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
#test2.sh

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-a

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-a

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-b

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-b

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0

Now I understand that handling concurrency isn't really a priority for this project (see another issue from last year: https://github.com/apache/cloudstack/issues/7907)

However, with the introduction of more vectors of automation (Terraform, Kubernetes), I believe that these types of issues might become really common. In our case, the IP conflicts have impacted more than 10 IPs total on a pool of 400 IPs, which is 2.5%.

bragonznx commented 6 months ago

Hi

We need that to be fix, how can we help ?

Best

DaanHoogland commented 6 months ago

Don't be putting the project down like that, concurrency is one of the main pillars of the project. We have not always been too good at it (as a project), admitted. Manpower vs priorities is the issue here, i think.

Thanks for the investigation @vdombrovski ! Can you do a test on 4.19 , or preferably main as well ?

@bragonznx , code investigation, design, PR, testing (big demand on that) are all needed. What can you offer ?

cheers guys.

vdombrovski commented 6 months ago

@DaanHoogland as I said, this doesn't seem to be a priority at the current time, even though it might be one of the main pillars of the project. I'm not judging here, I'm simply stating the facts based on the change logs. I also know that people working on this project have put a lot of effort recently into enabling alternative L3-L7 solutions that might be more robust than the virtual router (such as the support of VNF appliance and Tungsten Fabric's successor), which might in term enable production-grade uses without the use of proprietary components.

Regarding testing the new release, I am not able to perform these tests on other versions as we don't have a QA platform in those releases (we have put updates on hold following all the migration issues that were reported by people on the users mailing list). We are currently waiting for the 4.18.2.0 to be released as our next upgrade candidate. Maybe is there a particular commit / diff / change from the tree that catches your eye that would help in solving this issue?

In the meantime, what could prevent this bug from occurring in a lab environment? Maybe this is something that a concurrent test case might catch?

DaanHoogland commented 6 months ago

We are currently waiting for the 4.18.2.0 to be released as our next upgrade candidate.

it has been released ;)

As for how to prevent it, I have no ready answer. When creating the situation at fault, does cloudstack in its DB have any proof of the double assignment?

vdombrovski commented 6 months ago

@DaanHoogland yes correct, released after I wrote my comment, which means we will surely plan our tests now.

As for proof inside the DB, I've ran more tests, and I can most definitely introduce inconsistencies. For example, my last test got an IP with allocated=NULL but state=Allocated, which in practice means this ip is now unusable until I manually reset it in the database (it cannot be assigned, it cannot be released).

select * from user_ip_address where public_ip_address="XX.YY.17.30" \G;
*************************** 1. row ***************************
                 id: 107
               uuid: f1988f51-c9e5-46e4-9393-2a873b2610e0
         account_id: NULL
          domain_id: NULL
  public_ip_address: XX.YY.17.30
     data_center_id: 1
         source_nat: 0
          allocated: NULL
         vlan_db_id: 2
     one_to_one_nat: 0
              vm_id: NULL
              state: Allocated
        mac_address: 139
  source_network_id: 200
         network_id: 225
physical_network_id: 200
          is_system: 0
             vpc_id: NULL
          dnat_vmip: NULL
        is_portable: 0
            display: 1
            removed: NULL
            created: NULL
         rule_state: NULL
       forsystemvms: 0

Note that all these cases are introduced without any modifications to my original test; I'm simply changing the timings of when I start test1.sh and test2.sh

I'll run some more tests, and report any additional anomalies I find.

bragonznx commented 6 months ago

Hello

Do we have any news on this ? I believe it's too important breaking production, to wait for a sponsor to cover this patch

Best regards

rohityadavcloud commented 6 months ago

cc @weizhouapache @sureshanaparti to triage

rohityadavcloud commented 6 months ago

@vdombrovski @bragonznx have you tried to reproduce this issue against CloudStack 4.18.2.0 or 4.19.0.1 ? Anyone is welcome to propose a fix.

weizhouapache commented 6 months ago

@DaanHoogland here is how I successfully reproduced the issue on our QA platform:

56 bytes from 1e:00:4d:00:00:c1 (XX.YY.17.30): index=959 time=275.810 usec
56 bytes from 1e:00:6b:00:00:c2 (XX.YY.17.30): index=960 time=337.038 usec

You were right, this is a timing issue. Here is the raw cmk code to reproduce:

  • Create 2 isolated networks, network-a and network-b (mine are ids c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 and 7e2aa0f1-c536-4162-be43-aea277ae80aa
  • You don't need to add any VMs, only playing with virtual routers is enough to cause the issue
  • Target a specific IP address, in my case XX.YY.17.30 (id=f1988f51-c9e5-46e4-9393-2a873b2610e0)
  • Create 2 scripts as follows (make sure cmk works properly on your machine, I'm using 6.2.0)
  • Run both scripts in parallel in a while true loop as follows:

Terminal 1:

while true; do bash ./test1.sh; done

Terminal 2:

while true; do bash ./test2.sh; done
  • Test with arping on the IP (arping XX.YY.17.30), it should take around 2-3 mins on a platform with 0 load.
#test1.sh

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=22 privateport=22 name=vm-a

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=22 privateport=22 name=vm-b

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0
#test2.sh

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-a

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-a

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-b

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-b

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0

Now I understand that handling concurrency isn't really a priority for this project (see another issue from last year: #7907)

However, with the introduction of more vectors of automation (Terraform, Kubernetes), I believe that these types of issues might become really common. In our case, the IP conflicts have impacted more than 10 IPs total on a pool of 400 IPs, which is 2.5%.

thanks @vdombrovski
I adapted the two scripts you shared, and run them in parallel for around 2 hours with a public IP, two isolated networks, two vms I did not face any issue. I tested with latest main branch (4.20-SNAPSHOT), not 4.17.2

Can you share the full management-server.log for further investigation ?

vdombrovski commented 6 months ago

@weizhouapache because it's a timing issue, you need to make sure that both scripts run with a delay, otherwise you are getting a favorable case where both operations target network A, then both target network B etc..

Try modifying test2 to first target network B, and add a random sleep after disassociating the IP; Both of these changes greatly help with reproducting the issue (I could reproduce again now in < 1min), like this (leave test1 the exact same):

#test2.sh

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=7e2aa0f1-c536-4162-be43-aea277ae80aa # network B

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-b

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-b

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0

sleep $((RANDOM % 10))

cmk associate ipaddress projectid=05019f40-8552-4856-9cf1-02ae1b7e2621 zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7 ipaddress=XX.YY.17.30 account=qa networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0 # Network A

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=23 privateport=23 name=vm-a

cmk createLoadBalancerRule algorithm=roundrobin publicipid=f1988f51-c9e5-46e4-9393-2a873b2610e0 publicport=24 privateport=24 name=vm-a

cmk disassociate ipaddress id=f1988f51-c9e5-46e4-9393-2a873b2610e0

sleep $((RANDOM % 10))

As requested, here are the relevant logs from the manager, filtered for assignment+release of IP addresses, at the exact time the conflict occurs:

2024-05-02 14:32:41,747 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6674:ctx-b0b01e22) (logid:ace12149) ===START===  X.Y.191.80 -- GET  account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:41,805 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6674:ctx-b0b01e22 ctx-083bb8ea ctx-1ad4ec66) (logid:ace12149) ===END===  X.Y.191.80 -- GET  account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:41,837 DEBUG [c.c.n.IpAddressManagerImpl] (API-Job-Executor-30:ctx-4262738c job-2777 ctx-cd5a7ce3) (logid:eb86be5e) Successfully associated ip address XX.YY.17.30 to network Network {"id": 225, "name": "network-A", "uuid": "c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0", "networkofferingid": 9}
2024-05-02 14:32:41,850 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-30:ctx-4262738c job-2777 ctx-cd5a7ce3) (logid:eb86be5e) Complete async job-2777, jobStatus: SUCCEEDED, resultCode: 0, result: org.apache.cloudstack.api.response.IPAddressResponse/ipaddress/{"id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ipaddress":"XX.YY.17.30","allocated":"2024-05-02T14:32:41+0200","zoneid":"3adfd657-7e4e-4613-99da-49b3a0ef9db7","zonename":"ZONE","issourcenat":"false","projectid":"05019f40-8552-4856-9cf1-02ae1b7e2621","project":"testcpu","domainid":"e8858164-61cb-47ae-b05e-21163381e5ff","domain":"RND","forvirtualnetwork":"true","vlanid":"5c4f2de6-4555-4f8c-8af2-f2fd325bf101","vlanname":"vlan://untagged","isstaticnat":"false","issystem":"false","associatednetworkid":"c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0","associatednetworkname":"network-A","networkid":"08fb8c61-c810-487b-b931-1911a72a0f3e","state":"Allocated","physicalnetworkid":"cc445c30-6484-47e5-b13b-ffdd6554c02b","tags":[],"isportable":"false","fordisplay":"true","hasannotations":"false"}
2024-05-02 14:32:42,902 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6814:ctx-826cfc0d) (logid:08064c3d) ===START===  X.Y.191.80 -- GET  account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:42,932 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6814:ctx-826cfc0d ctx-96b1e3fd ctx-d8a477c6) (logid:08064c3d) ===END===  X.Y.191.80 -- GET  account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:48,278 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6705:ctx-3cafb699) (logid:b65f7570) ===START===  X.Y.191.80 -- GET  apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:48,301 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (qtp1750498848-6705:ctx-3cafb699 ctx-59398c9e ctx-8f5a2116) (logid:b65f7570) submit async job-2780, details: AsyncJobVO: {id:2780, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70154","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:48,301 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-33:ctx-7c0ac951 job-2780) (logid:94060328) Executing AsyncJobVO: {id:2780, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70154","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-
46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:48,302 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6705:ctx-3cafb699 ctx-59398c9e ctx-8f5a2116) (logid:b65f7570) ===END===  X.Y.191.80 -- GET  apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:51,347 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6807:ctx-ac7656e2) (logid:ae5760e6) ===START===  X.Y.191.80 -- GET  apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:51,362 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (qtp1750498848-6807:ctx-ac7656e2 ctx-07d762b2 ctx-c606fd19) (logid:ae5760e6) submit async job-2782, details: AsyncJobVO: {id:2782, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70161","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:51,362 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-35:ctx-1248a5f3 job-2782) (logid:bb1f9468) Executing AsyncJobVO: {id:2782, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70161","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:51,363 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6807:ctx-ac7656e2 ctx-07d762b2 ctx-c606fd19) (logid:ae5760e6) ===END===  X.Y.191.80 -- GET  apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:33:07,514 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-33:ctx-7c0ac951 job-2780) (logid:94060328) Complete async job-2780, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Failed to disassociate IP address"}
2024-05-02 14:32:41,747 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6674:ctx-b0b01e22) (logid:ace12149) ===START===  X.Y.191.80 -- GET  account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:41,805 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6674:ctx-b0b01e22 ctx-083bb8ea ctx-1ad4ec66) (logid:ace12149) ===END===  X.Y.191.80 -- GET  account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:41,837 DEBUG [c.c.n.IpAddressManagerImpl] (API-Job-Executor-30:ctx-4262738c job-2777 ctx-cd5a7ce3) (logid:eb86be5e) Successfully associated ip address XX.YY.17.30 to network Network {"id": 225, "name": "network-A", "uuid": "c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0", "networkofferingid": 9}
2024-05-02 14:32:41,850 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-30:ctx-4262738c job-2777 ctx-cd5a7ce3) (logid:eb86be5e) Complete async job-2777, jobStatus: SUCCEEDED, resultCode: 0, result: org.apache.cloudstack.api.response.IPAddressResponse/ipaddress/{"id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ipaddress":"XX.YY.17.30","allocated":"2024-05-02T14:32:41+0200","zoneid":"3adfd657-7e4e-4613-99da-49b3a0ef9db7","zonename":"ZONE","issourcenat":"false","projectid":"05019f40-8552-4856-9cf1-02ae1b7e2621","project":"testcpu","domainid":"e8858164-61cb-47ae-b05e-21163381e5ff","domain":"RND","forvirtualnetwork":"true","vlanid":"5c4f2de6-4555-4f8c-8af2-f2fd325bf101","vlanname":"vlan://untagged","isstaticnat":"false","issystem":"false","associatednetworkid":"c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0","associatednetworkname":"network-A","networkid":"08fb8c61-c810-487b-b931-1911a72a0f3e","state":"Allocated","physicalnetworkid":"cc445c30-6484-47e5-b13b-ffdd6554c02b","tags":[],"isportable":"false","fordisplay":"true","hasannotations":"false"}
2024-05-02 14:32:42,902 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6814:ctx-826cfc0d) (logid:08064c3d) ===START===  X.Y.191.80 -- GET  account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:42,932 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6814:ctx-826cfc0d ctx-96b1e3fd ctx-d8a477c6) (logid:08064c3d) ===END===  X.Y.191.80 -- GET  account=test-admin&apiKey=REDACTED&command=associateIpAddress&ipaddress=XX.YY.17.30&networkid=c6a4bc87-93cc-499c-8fdc-4e44b4d0fca0&projectid=05019f40-8552-4856-9cf1-02ae1b7e2621&response=json&zoneid=3adfd657-7e4e-4613-99da-49b3a0ef9db7&signature=pcavcaveiIHGjw%2FlWwFeH3mX63Q%3D
2024-05-02 14:32:48,278 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6705:ctx-3cafb699) (logid:b65f7570) ===START===  X.Y.191.80 -- GET  apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:48,301 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (qtp1750498848-6705:ctx-3cafb699 ctx-59398c9e ctx-8f5a2116) (logid:b65f7570) submit async job-2780, details: AsyncJobVO: {id:2780, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70154","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:48,301 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-33:ctx-7c0ac951 job-2780) (logid:94060328) Executing AsyncJobVO: {id:2780, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70154","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:48,302 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6705:ctx-3cafb699 ctx-59398c9e ctx-8f5a2116) (logid:b65f7570) ===END===  X.Y.191.80 -- GET  apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:51,347 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6807:ctx-ac7656e2) (logid:ae5760e6) ===START===  X.Y.191.80 -- GET  apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
2024-05-02 14:32:51,362 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (qtp1750498848-6807:ctx-ac7656e2 ctx-07d762b2 ctx-c606fd19) (logid:ae5760e6) submit async job-2782, details: AsyncJobVO: {id:2782, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70161","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:51,362 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-35:ctx-1248a5f3 job-2782) (logid:bb1f9468) Executing AsyncJobVO: {id:2782, userId: 2, accountId: 2, instanceType: IpAddress, instanceId: 107, cmd: org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd, cmdInfo: {"apiKey":"REDACTED","signature":"02Sf59NUDKw/Kg/ju1gg+1229mQ\u003d","response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"70161","id":"f1988f51-c9e5-46e4-9393-2a873b2610e0","ctxDetails":"{\"interface com.cloud.network.IpAddress\":\"f1988f51-c9e5-46e4-9393-2a873b2610e0\"}","ctxAccountId":"2","uuid":"f1988f51-c9e5-46e4-9393-2a873b2610e0","cmdEventType":"NET.IPRELEASE"}, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 198690281380792, completeMsid: null, lastUpdated: null, lastPolled: null, created: null, removed: null}
2024-05-02 14:32:51,363 DEBUG [c.c.a.ApiServlet] (qtp1750498848-6807:ctx-ac7656e2 ctx-07d762b2 ctx-c606fd19) (logid:ae5760e6) ===END===  X.Y.191.80 -- GET  apiKey=REDACTED&command=disassociateIpAddress&id=f1988f51-c9e5-46e4-9393-2a873b2610e0&response=json&signature=02Sf59NUDKw%2FKg%2Fju1gg%2B1229mQ%3D
bragonznx commented 6 months ago

@rohityadavcloud do you have any fix of this in the changelog from 4.18 or 4.19 ? Because if it's not the case, I don't see the point you mentioned at all

We will reproduce this behavior on a 4.19 if need in due time, but for now we have 4 or 5 platform in 4.17 who can't be migrate because 4.18 didn't passed our testing scenario yet : each time we had regression

Best

bragonznx commented 6 months ago

We haven't any java development capability : we only have Python developer in our company. But I'm open to any proposal to fix this very annoying issue.

weizhouapache commented 6 months ago

@vdombrovski @bragonznx after some hours, I am able to reproduce the issue twice

it looks like the issue is caused by concurrent actions

The action failed but the IP is marked as Free, but actually it is still configured in the VRs

To confirm it, can you check the management server log

egrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log
bragonznx commented 6 months ago

root@mgt-01:~# zgrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*

root@mgt-01:~#

root@mgt-01:~# egrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log

root@mgt-01:~#

weizhouapache commented 6 months ago

zgrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*

@bragonznx can you use zegrep instead ?

zegrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*
bragonznx commented 6 months ago

root@mgt-01:~# zegrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*

root@mgt-01:~#

no matter the kind of grep you want to use, there is no occurrence in our production logs.

weizhouapache commented 6 months ago

root@mgt-01:~# zegrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*

root@mgt-01:~#

no matter the kind of grep you want to use, there is no occurrence in our production logs.

@bragonznx if you have multiple management servers, can you check the logs on other management servers ?

DaanHoogland commented 6 months ago

zegrep 'Failed to release | Unable to revoke ' /var/log/cloudstack/management/management-server.log.*

is the pipe symbol within the single quotes a problem in this command?

maybe try

zegrep 'Failed to release.*Unable to revoke' /var/log/cloudstack/management/management-server.log.*

instead?

bragonznx commented 6 months ago

root@mgt-01:~# zegrep 'Failed to release.Unable to revoke' /var/log/cloudstack/management/management-server.log.

root@mgt-01:~#

weizhouapache commented 6 months ago

root@mgt-01:~# zegrep 'Failed to release.Unable to revoke' /var/log/cloudstack/management/management-server.log.

root@mgt-01:~#

@bragonznx Can you check other management servers if any?

bragonznx commented 6 months ago

It's on management server where @vdombrovski did all the tests, so it should be here....

weizhouapache commented 6 months ago

It's on management server where @vdombrovski did all the tests, so it should be here....

@bragonznx it would be good to upload the management server logs. if not possible, can you please search Exception/error in the logs ?

zegrep -i "exception|error" /var/log/cloudstack/management/management-server.log.*

bragonznx commented 6 months ago

What do you want to proof ? Because we have both got the issue of this IP conflicts issue. I can't spend my time digging in Exception/error, I have too much of them.

If you need me to copy past millions of "/var/log/cloudstack/management/management-server.log.2024-01-05.gz:2024-01-05 02:55:12,904 ERROR [c.c.s.StatsCollector] (StatsCollector-5:ctx-24d81e3e) (logid:81fce2ab)" you have to explain me the value, because for me it's around 0. We don't have to proof again the issue, everyone in the discussion know that it's real, and the issue is here, Vladimir replicated it several times, and you managed to do it on your side too.

weizhouapache commented 6 months ago

What do you want to proof ? Because we have both got the issue of this IP conflicts issue. I can't spend my time digging in Exception/error, I have too much of them.

If you need me to copy past millions of "/var/log/cloudstack/management/management-server.log.2024-01-05.gz:2024-01-05 02:55:12,904 ERROR [c.c.s.StatsCollector] (StatsCollector-5:ctx-24d81e3e) (logid:81fce2ab)" you have to explain me the value, because for me it's around 0. We don't have to proof again the issue, everyone in the discussion know that it's real, and the issue is here, Vladimir replicated it several times, and you managed to do it on your side too.

@bragonznx yes, I have reproduced the issue. However, it looks like the root cause is different from yours, because I saw some errors like Failed to release resources for ip address and Unable to revoke all the firewall rules for ip sometimes in my log, but you did not find it in your logs.

I did not expect you dig into the management server logs. I just requested you to share some management server log as the logs might give some clue what the root cause is.

It is up to you upload or not upload the management server logs (of the days when @vdombrovski tested).

DaanHoogland commented 6 months ago

What do you want to proof ?

@bragonznx , if you really want to provide "proof", it would need to be a fix. This is open source and Wei is trying to figure it out , out of interest.

Because we have both got the issue of this IP conflicts issue. I can't spend my time digging in Exception/error, I have too much of them.

Neither can anybody else spend the time for you, if you are not willing.

If you need me to copy past millions of "/var/log/cloudstack/management/management-server.log.2024-01-05.gz:2024-01-05 02:55:12,904 ERROR [c.c.s.StatsCollector] (StatsCollector-5:ctx-24d81e3e) (logid:81fce2ab)" you have to explain me the value, because for me it's around 0. We don't have to proof again the issue, everyone in the discussion know that it's real, and the issue is here, Vladimir replicated it several times, and you managed to do it on your side too.

Nobody is obliged to care about your issue either! Again, this is open source and interested people are willing to help you solve it, but please pay them respect.

bragonznx commented 6 months ago

"Nobody is obligated to care about your issue. However, in open source, there are individuals who are genuinely interested in helping you resolve it. It's important to show them respect."

Can you point out where we have been disrespectful? I personally spent a lot of time digging within overall all our cloudstack platform to give you the fact that the logs are full with nothing you are looking about ! How it is isrespectful, can you highlight me ?

We have provided all the necessary materials, evidence, scripts, and steps to reproduce the issue across multiple tickets.

Both of us rely heavily on CloudStack, an "open-source project", to generate revenue through our company's services. I am capable of changing projects if necessary; after all, we use libvirt extensively, and I'm well-versed in it. So, we as a company aren't stuck. I can pivot in just two minutes and chart a new roadmap to go out.

On your end, you offer CloudStack development services and expertise, but there's an issue online that portrays you in a negative light. It appears as if you're dismissing the efforts of those who are trying to highlight design flaws within CloudStack. Again we aren't developpers company we are a cloud company and you provide a cloud solution, so when a cloud provider with many platform in production give you a lot of evidences of miss behavior, keep in mind that we aren't doing this on our week end !

How do you think the community will perceive this ?

Your behavior could be seen as belittling users, especially regarding a major asynchronous design flaw. Whenever someone searches for a solution to a public IP issue, they will come across this problem, and it will be widely circulated.

We're willing to fund a patch, but let's refrain from accusing others of disrespect. Using the open-source flag as a shield here is risky. You have my contact if you need to reach me, we already bought professional services ...

Best regards without any disrespect !

GutoVeronezi commented 6 months ago

Hello @bragonznx,

Thank you and @vdombrovski for reporting the situation and providing solid and concrete information for us to reproduce the issue; and I am sorry that you guys are not having a good experience while reporting an issue.

@hsato03 and I were able to reproduce the situation and we are now analyzing the workflows to fix it. We hope to open a PR shortly to share our findings and tests

weizhouapache commented 6 months ago

@hsato03 and I were able to reproduce the situation and we are now analyzing the workflows to fix it. We hope to open a PR shortly to share our findings and tests

@GutoVeronezi can you share the details of the issue you encountered ?

if you get the error Failed to release and Unable to revoke in the management server log, the issue you faced might be same as what I reproduced, but not same as the reporter faced.

vdombrovski commented 6 months ago

Hello @weizhouapache, sorry for the late reply.

I ran the script again to fetch your logs, I can confirm that:

  1. These logs do sometimes appear when running the script (see below)
  2. These logs are absolutely not related to the original issue

For example, I'm getting these logs:

tail -f /var/log/cloudstack/management/management-server.log | grep -E "Failed to release|Unable to revoke"                                                         
2024-05-13 11:35:04,879 WARN  [c.c.n.IpAddressManagerImpl] (API-Job-Executor-87:ctx-e73f179d job-2834 ctx-d655a156) (logid:f8ef96fd) Unable to revoke all the firewall rules for ip id=107 as a part of ip release
2024-05-13 11:35:22,448 WARN  [c.c.n.IpAddressManagerImpl] (API-Job-Executor-87:ctx-e73f179d job-2834 ctx-d655a156) (logid:f8ef96fd) Failed to release resources for ip address id=107
2024-05-13 11:35:22,483 WARN  [c.c.n.NetworkServiceImpl] (API-Job-Executor-87:ctx-e73f179d job-2834 ctx-d655a156) (logid:f8ef96fd) Failed to release public ip address id=107
2024-05-13 11:37:15,709 WARN  [c.c.n.IpAddressManagerImpl] (API-Job-Executor-112:ctx-95282fbe job-2859 ctx-4f1f38f7) (logid:ee6bbf3d) Unable to revoke all the firewall rules for ip id=107 as a part of ip release
2024-05-13 11:37:33,802 WARN  [c.c.n.IpAddressManagerImpl] (API-Job-Executor-112:ctx-95282fbe job-2859 ctx-4f1f38f7) (logid:ee6bbf3d) Failed to release resources for ip address id=107
2024-05-13 11:37:33,841 WARN  [c.c.n.NetworkServiceImpl] (API-Job-Executor-112:ctx-95282fbe job-2859 ctx-4f1f38f7) (logid:ee6bbf3d) Failed to release public ip address id=107
2024-05-13 11:38:43,386 WARN  [c.c.n.IpAddressManagerImpl] (API-Job-Executor-4:ctx-a99870ee job-2876 ctx-15237616) (logid:5632f157) Unable to revoke all the firewall rules for ip id=107 as a part of ip release
2024-05-13 11:38:53,873 WARN  [c.c.n.IpAddressManagerImpl] (API-Job-Executor-4:ctx-a99870ee job-2876 ctx-15237616) (logid:5632f157) Failed to release resources for ip address id=107
2024-05-13 11:39:00,601 WARN  [c.c.n.NetworkServiceImpl] (API-Job-Executor-4:ctx-a99870ee job-2876 ctx-15237616) (logid:5632f157) Failed to release public ip address id=107
2024-05-13 11:41:42,867 WARN  [c.c.n.IpAddressManagerImpl] (API-Job-Executor-41:ctx-06acb0e1 job-2913 ctx-63672a38) (logid:477fad2b) Unable to revoke all the firewall rules for ip id=107 as a part of ip release
2024-05-13 11:42:00,821 WARN  [c.c.n.IpAddressManagerImpl] (API-Job-Executor-41:ctx-06acb0e1 job-2913 ctx-63672a38) (logid:477fad2b) Failed to release resources for ip address id=107
2024-05-13 11:42:00,848 WARN  [c.c.n.NetworkServiceImpl] (API-Job-Executor-41:ctx-06acb0e1 job-2913 ctx-63672a38) (logid:477fad2b) Failed to release public ip address id=107

However, there is no IP conflict introduced when this log appears (arping returns a single MAC address).

Similarly, when the IP conflict appears, this log is not present at all (at least in the 3-4 last iterations which lead to the conflict in my tests).

I think there are 2 separate issues here:

  1. The manager itself is vulnerable to race-conditions, leading to your observations in the logs, and to the fact an IP can become deadlocked in the database
  2. The provisioning logic which governs the assignment of public IP addresses / firewall rules etc... is itself flawed, allowing 2 VRs to deploy a stale version of the configuration, which causes the IP conflict even from a clean database state.

Hopefully these details help this investigation.

weizhouapache commented 6 months ago

thanks @vdombrovski same as suspected, I think your issue is different as what I faced in my testing (should be fixed by #9059).

DaanHoogland commented 5 months ago

@hsato03 are you working on this? I see you assigned but also an open PR by @weizhouapache : #9059

weizhouapache commented 5 months ago

@hsato03 are you working on this? I see you assigned but also an open PR by @weizhouapache : #9059

@DaanHoogland The issue I fixed in #9059 is not exactly the same as @vdombrovski reported.

@hsato03 @GutoVeronezi said they have reproduced the issue, let's wait for their PR.

hsato03 commented 5 months ago

Hello @bragonznx @vdombrovski

We reproduced the issue using the script you provided in this discussion and also discovered that this situation also happens when creating a remote access VPN. The situation we found was happening due to race conditions, as you guys stated in the issue. We created PR #9234 to improve some IP-related APIs that were prone to race conditions. Could you guys check and try the PR? Also, if you find anything else or your situation happens to be different, just let me know and I will gladly try to help you.

hsato03 commented 5 months ago

@bragonznx @vdombrovski

We also discovered that the public IP can be duplicated when different VRs have the same control IP. Could you check the database to see that there are no duplicate control IPs on your VRs?

sureshanaparti commented 4 months ago

Hi @vdombrovski Can you test the PR #9234 & confirm if the IP conflicts / race conditions are resolved or not.

vdombrovski commented 4 months ago

Hello @sureshanaparti, I would love to test this patch, however I can't apply it against the 4.17.2.0 branch:

git apply 9234.patch
error: patch failed: server/src/main/java/com/cloud/network/IpAddressManagerImpl.java:725
error: server/src/main/java/com/cloud/network/IpAddressManagerImpl.java: patch does not apply
error: patch failed: server/src/main/java/com/cloud/network/lb/LoadBalancingRulesManagerImpl.java:1814
error: server/src/main/java/com/cloud/network/lb/LoadBalancingRulesManagerImpl.java: patch does not apply

Like described before, I do not currently have a running 4.19 Cloudstack setup because of migration issues, so my tests are limited to the 4.17 release

sureshanaparti commented 4 months ago

Thanks @vdombrovski, Can you review the PR (and check if it covers the IP conflict fixes).

hsato03 commented 4 months ago

Hello @vdombrovski,

In order for you to test, I made the backport of #9234 to the 4.17.2 version in my repository. You can clone it by using the following command:

git clone -b 4.17.2-fix-ip-race-conditions git@github.com:hsato03/cloudstack.git
vdombrovski commented 4 months ago

Hello @hsato03,

Thank you for the backport, I've tested the patch, this looks good, no more IP conflicts detected, and all rules still apply properly as before (as much as possible at least). I definitely see some NPEs now though, in case this is something you are trying to prevent, see examples below (might be out of scope for this issue).

Anyway thank you for the amazing work on this subject.

2024-06-27 17:23:09,634 WARN  [o.a.c.a.c.u.l.CreateLoadBalancerRuleCmd] (API-Job-Executor-98:ctx-caca6b4e job-3390 ctx-cbaca4cc) (logid:518a1188) Failed to create LB rule due to exception
java.lang.NullPointerException
        at com.cloud.network.firewall.FirewallManagerImpl.applyRules(FirewallManagerImpl.java:594)
        at com.cloud.network.firewall.FirewallManagerImpl.applyFirewallRules(FirewallManagerImpl.java:700)
        at com.cloud.network.firewall.FirewallManagerImpl.applyIngressFirewallRules(FirewallManagerImpl.java:671)
        at jdk.internal.reflect.GeneratedMethodAccessor227.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
        at com.sun.proxy.$Proxy258.applyIngressFirewallRules(Unknown Source)
        at org.apache.cloudstack.api.command.user.loadbalancer.CreateLoadBalancerRuleCmd.execute(CreateLoadBalancerRuleCmd.java:280)
        at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163)
        at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:106)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

and

2024-06-27 17:23:13,637 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-99:ctx-f17bb0ec job-3391) (logid:61dae7ad) Unexpected exception while executing org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd
java.lang.NullPointerException
        at com.cloud.network.lb.LoadBalancingRulesManagerImpl.deleteLoadBalancerRule(LoadBalancingRulesManagerImpl.java:1484)
        at com.cloud.network.lb.LoadBalancingRulesManagerImpl.removeAllLoadBalanacersForIp(LoadBalancingRulesManagerImpl.java:2031)
        at jdk.internal.reflect.GeneratedMethodAccessor231.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
        at com.sun.proxy.$Proxy178.removeAllLoadBalanacersForIp(Unknown Source)
        at com.cloud.network.IpAddressManagerImpl.cleanupIpResources(IpAddressManagerImpl.java:682)
        at com.cloud.network.IpAddressManagerImpl.disassociatePublicIpAddress(IpAddressManagerImpl.java:716)
        at com.cloud.network.NetworkServiceImpl.releaseIpAddressInternal(NetworkServiceImpl.java:1119)
        at com.cloud.network.NetworkServiceImpl.releaseIpAddress(NetworkServiceImpl.java:1063)
        at jdk.internal.reflect.GeneratedMethodAccessor282.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:107)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
        at com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:52)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
        at com.sun.proxy.$Proxy170.releaseIpAddress(Unknown Source)
        at org.apache.cloudstack.api.command.user.address.DisassociateIPAddrCmd.execute(DisassociateIPAddrCmd.java:80)
        at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163)
        at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:106)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
hsato03 commented 4 months ago

@vdombrovski Thank you for testing and pointing that out. I will address these NPEs in #9234.

I'm glad this patch helped you.

sureshanaparti commented 4 months ago

Thanks @vdombrovski @hsato03. Fixed in #9234.