Closed soreana closed 7 months ago
Any suggestion appreciated.
@soreana what hypervisor do you use ? can you check if all mgmt servers and hypervisors are upgraded to 4.18.1.0 ?
can you ssh into the backup router (from kvm host), and share the /var/log/cloud.log ? I suspect you are facing the issue fixed by https://github.com/apache/cloudstack/pull/7953
@weizhouapache Thanks for prompt reply.
Yes, both management server and hypervisors are updated to 4.18.1.0.
However, it doesn't seem this is the bug you mentioned. As per Cloudstack management server, the job for creating a new virtual router is created, but the hypervisor never receives the command to start the new routerVM.
Here are the last messages from the management server regarding starting new VM:
2023-10-09 10:49:49,488 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:49:49,510 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 10:49:49,531 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 10:49:49,552 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4143
2023-10-09 10:49:49,574 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4144
2023-10-09 10:49:49,596 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4145
2023-10-09 10:49:49,616 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4146
2023-10-09 10:49:49,637 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4147
2023-10-09 10:49:49,678 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4148
2023-10-09 10:49:49,699 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4149
2023-10-09 10:49:49,721 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4150
2023-10-09 10:49:50,028 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:49:50,060 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:49:50,087 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:49:50,140 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 10:49:50,172 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 10:49:50,200 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 10:49:50,253 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 10:49:50,285 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 10:49:50,313 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 10:49:50,363 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4143
2023-10-09 10:49:50,394 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4143
2023-10-09 10:49:50,421 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4143
2023-10-09 10:49:50,472 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4144
2023-10-09 10:49:50,503 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4144
2023-10-09 10:49:50,533 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4144
2023-10-09 10:49:50,581 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4145
2023-10-09 10:49:50,612 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4145
2023-10-09 10:49:50,641 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4145
2023-10-09 10:49:50,692 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4146
2023-10-09 10:49:50,721 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4146
2023-10-09 10:49:50,750 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4146
2023-10-09 10:49:50,808 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4147
2023-10-09 10:49:50,842 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4147
2023-10-09 10:49:50,874 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4147
2023-10-09 10:49:50,929 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4148
2023-10-09 10:49:50,960 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4148
2023-10-09 10:49:50,989 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4148
2023-10-09 10:49:51,041 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4149
2023-10-09 10:49:51,075 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4149
2023-10-09 10:49:51,102 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4149
2023-10-09 10:49:51,151 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4150
2023-10-09 10:49:51,182 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4150
2023-10-09 10:49:51,211 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4150
2023-10-09 10:49:51,242 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 static routes to apply as a part of vpc route VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start
2023-10-09 10:49:51,250 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 10:57:21,381 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Resending ipAssoc, port forwarding, load balancing rules as a part of Virtual router start
2023-10-09 10:57:21,389 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 10:57:21,389 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 10:57:21,397 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start for guest network id=4140
2023-10-09 10:57:21,415 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:57:21,441 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:04:49,515 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying dhcp entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:04:49,525 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating dhcp entry for vm VM instance {"id":27189,"instanceName":"i-877-27189-VM","type":"User","uuid":"f34641cc-1885-430e-9eef-0f2560527f69"} on domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"}.
2023-10-09 11:04:49,539 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying vm data (userData and metaData) entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:04:49,546 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating user data entry for vm VM instance {"id":27189,"instanceName":"i-877-27189-VM","type":"User","uuid":"f34641cc-1885-430e-9eef-0f2560527f69"} on domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"}
2023-10-09 11:04:49,581 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Resending ipAssoc, port forwarding, load balancing rules as a part of Virtual router start
2023-10-09 11:04:49,590 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 11:04:49,590 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 11:04:49,600 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start for guest network id=4141
2023-10-09 11:04:49,625 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 11:04:49,654 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:12:18,234 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying dhcp entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:12:18,245 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating dhcp entry for vm VM instance {"id":27190,"instanceName":"i-877-27190-VM","type":"User","uuid":"4aa04dae-091d-4f78-bc32-af5da5a1ca2d"} on domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"}.
2023-10-09 11:12:18,262 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying vm data (userData and metaData) entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:12:18,270 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating user data entry for vm VM instance {"id":27190,"instanceName":"i-877-27190-VM","type":"User","uuid":"4aa04dae-091d-4f78-bc32-af5da5a1ca2d"} on domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"}
2023-10-09 11:12:18,307 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Resending ipAssoc, port forwarding, load balancing rules as a part of Virtual router start
2023-10-09 11:12:18,315 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 11:12:18,315 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 11:12:18,325 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start for guest network id=4142
2023-10-09 11:12:18,345 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 11:12:18,372 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
After this line, no activity on this job anymore (the new routerVM appears as "Starting").
The hypervisor that is chosen for the routerVM doesn't have any mentions of its name in the agent.log:
# grep r-27201-VM /var/log/cloudstack/agent/agent.log
#
It looks like the management server doesn't even send VM creation command to the hypervisor.
An hour later the job ends with the following message:
(VPC restart test ) Unable to serialize: Job is cancelled as it has been blocking others for too long
The running routerVM (the one in Running state while Cloudstack is trying to create the second routerVM) indeed shows the lines from the bug you've mentioned:
root@r-27188-VM:~# cat /var/log/cloud.log | grep md5
Mon 09 Oct 2023 08:39:31 AM UTC Scripts checksum detected: oldmd5=da16d6a2533fd19960304152dd78e02c newmd5=da16d6a2533fd19960304152dd78e02c
Mon 09 Oct 2023 08:39:39 AM UTC Scripts checksum detected: oldmd5=da16d6a2533fd19960304152dd78e02c newmd5=28d602db87f545cced0664df0ed9d437
However the timestamps in the log correspond to the VPC creation time (I created this test VPC at 09 Oct 2023 08:39:10).
I'm still attempting to figure out the cause of the bug by removing VPC elements and trying to restart it. So far I have a suspicion that large amount of tiers (the VPC has 11 tiers) is the cause but it is yet to be confirmed.
I will post an update here with the reproduce scenario once I have it.
@soreana do other vpcs have the same issue ?
@soreana do other vpcs have the same issue ?
@weizhouapache When VPC has a lot of tiers (around 10 to 11 tiers) the issue happens.
@soreana do other vpcs have the same issue ?
@weizhouapache When VPC has a lot of tiers (around 10 to 11 tiers) the issue happens.
It is strange that primary router works but backup does not. there is no difference on the management server for them. (The primary/backup role is determined by the keepalived running inside VRs.)
can you try the steps below
@weizhouapache Thanks for the comment. They didn't help.
I've updated the issue description with more information. Any help appreciated 🙏
It is strange that primary router works but backup does not. there is no difference on the management server for them. (The primary/backup role is determined by the keepalived running inside VRs.)
The primary router keeps working as it is still running on 4.17. The restart with cleanup first tries to recreate the backup router, which is not created.
When you delete both primary and backup followed by restart with cleanup, no router is created.
@kriegsmanj Oh, I thought that the primary router has been recreated successfully ... This seems to be a critical issue.
as @soreana said, you have created another VPC with same configuration, it works, right ?
@weizhouapache
We did several tests:
We created another VPC with the same configuration on the same platform: it also fails to recreate VR when restarting with cleanup.
We created a VPC with the same configuration on a different cloudstack installation, that has lot less load on it: we could not reproduce the issue there. However, we noticed that it takes approximately 20 minutes before VR instance even appears in the hypervisor's "virsh list" output. Once its appreared on the hypervisor, it reports as online in Cloudstack within couple of minutes.
We tried reproducing this behavior on 4.17: it seems that 4.17 is not affected. The same VPC configuration fully restarts in 9 minutes.
Do you have any suggestion?
@soreana @kriegsmanj
I created a vpc with 12 tiers (each tier has a vm), restart with cleanup worked well. there is not such long time intervan between these 2 logs
2023-10-09 11:04:49,654 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:12:18,234 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying dhcp entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
I did not see any code changes between 4.17 and 4.18 which might cause the issue.
@soreana @kriegsmanj any update on this issue ?
We have tried to reporoduce the bug multiple times under different circumstances. Unfortunately, there's no reliable reproduce scenario so far. Here are our findings:
We continue our investigation.
@weizhouapache We have been performing more tests and now are suspecting that ACL with multiple rules could be the cause of this delay in deploying routers. A VPC with 6 tiers and 7 customer ACL rules took more than 30 minutes to create, while the same VPC with default ACL rule took about 15 minutes (50% quicker).
Can you please explain the process of creation of ACL rules within a VPC and where it is taking so long to apply? Also how we could speed up this process in order to verify if it helps.
Thanks in advance!
@soreana @kriegsmanj
can you search "vr_cfg.sh" in the management-server.log and agent.log ?
@soreana @kriegsmanj
can you search "vr_cfg.sh" in the management-server.log and agent.log ? is your issue same as #8201 ?
@weizhouapache Sorry about the delay, it was a busy day.
I checked the vr_cfg.sh
in management servers and agents logs. There were nothing in the management server, and only three agents has the vr_cfg.sh
in their logs.
node1
2023-11-08 13:49:12,258 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-10c9e64e-6ad9-4d9b-8d24-fe1496b59146.cfg
2023-11-08 13:49:15,310 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:15,488 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-e7dfd01c-8c30-43be-bb1d-38229c8c1edf.cfg
2023-11-08 13:49:18,352 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:18,524 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-5d36d38c-a1ce-49af-9e01-15d4f19054fa.cfg
2023-11-08 13:49:21,423 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:21,601 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-67f79b72-a3ed-4d62-a6e3-919a7f3dfd92.cfg
2023-11-08 13:49:24,574 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:24,760 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-a3318f66-9666-41e3-91b9-3ede5ab05962.cfg
2023-11-08 13:49:27,821 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:28,005 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-b1fd4fbd-c7db-4970-b17a-9eb374543fa9.cfg
2023-11-08 13:49:31,281 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:31,450 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-0fcabd27-eeee-4061-aeac-663daf9775f0.cfg
2023-11-08 13:49:34,570 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:34,776 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-216b0865-9cd1-45d3-a313-650e7d96d299.cfg
2023-11-08 13:49:38,086 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
node2:
2023-11-08 14:25:55,894 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-b5d6ddb5-b5e8-4366-b8d3-ab4772c22b66.cfg
2023-11-08 14:25:58,480 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:25:58,670 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-fc6ccf8d-1d74-4766-9a96-46308e1eada7.cfg
2023-11-08 14:26:01,337 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:01,507 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-3acdb46e-5d06-4744-b324-8c18bc407a74.cfg
2023-11-08 14:26:04,199 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:04,373 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-75c0d306-2a39-44a5-95d7-bcd207433203.cfg
2023-11-08 14:26:07,147 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:07,315 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-09efdd5f-60fe-4ac1-b588-43ccb8ac8a98.cfg
2023-11-08 14:26:10,067 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:10,233 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-927f2fe4-389a-4275-91e3-17ab8c569861.cfg
2023-11-08 14:26:13,000 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:13,170 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-32c655e4-44f8-4e90-9a27-218b07f76c17.cfg
2023-11-08 14:26:15,955 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:16,126 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-6d6d1c86-235e-4c5c-a0d1-6cd1d3f54852.cfg
2023-11-08 14:26:18,982 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
node3:
2023-11-08 11:02:13,577 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.155 -c /var/cache/cloud/VR-eaaf92a8-5a49-4c58-bebf-ffae9cefa301.cfg
2023-11-08 11:02:22,879 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:null) Executing script in VR: vr_cfg.sh
2023-11-08 11:02:30,521 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.155 -c /var/cache/cloud/VR-d3be0b78-a04d-4437-89f2-40a83650edd0.cfg
2023-11-08 11:02:38,899 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
IPs are as follow:
169.254.1.144
is primary router IP the VPC-01
169.254.3.248
is backup router IP the VPC-01
169.254.3.155
is router ip of an isolated network
@weizhouapache
An update on this issue: This bug only occurs on VPCs on large platforms "with history": the platforms that have a lot of both live and removed VMs, NICs etc.
The mechanics of the process:
With trusty s_logger.debug() I was able to narrow down the VR start execution procedure to a specific method that introduces such a big delay. It is this method: getRouterHealthChecksConfig
More specifically, its this call inside the method that introduces the delay: userVmJoinDao.search(scvm, null) It is executed in a loop for each virtual router NIC. Each call to search() is expensive, so the getRouterHealthChecksConfig() takes 10-20 seconds to complete.
But there is more: getRouterHealthChecksConfig() is executed for each VPC tier as a part of createMonitorServiceCommand() call during VR startup process.
The createMonitorServiceCommand() is a part of finalizeMonitorService() method.
And finalizeMonitorService() method is executed twice here: 1, 2
So eventually the getRouterHealthChecksConfig() is executed more than a hundred times during the VR startup, and each time it adds its 10-20 seconds to the process.
What do I propose It seems that getRouterHealthChecksConfig() returns the same result per VR object. That means that it can be executed once in the beginning of VR initialisation instead of deep inside of foreach loops. But my Java knowledge unfortunately isn't good enough to actually implement it.
@phsm Great, thanks for the update
maybe @soreana can pick this up ?
@weizhouapache Hi Wei,
I've actually managed to implement the patch myself.
The test looks promising: Without this patch applied, the 10-tier VPC took 61 minutes to fully restart With this patch applied, the same VPC took 9 minutes to fully restart
Can you take a look at it? If it looks good to you, I can make a PR with it.
diff --git a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
index f7bfb1c4af..3d236c0a13 100644
--- a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
+++ b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
@@ -1623,7 +1623,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
}
private SetMonitorServiceCommand createMonitorServiceCommand(DomainRouterVO router, List<MonitorServiceTO> services,
- boolean reconfigure, boolean deleteFromProcessedCache) {
+ boolean reconfigure, boolean deleteFromProcessedCache, Map<String, String> routerHealthCheckConfig) {
final SetMonitorServiceCommand command = new SetMonitorServiceCommand(services);
command.setAccessDetail(NetworkElementCommand.ROUTER_IP, _routerControlHelper.getRouterControlIp(router.getId()));
command.setAccessDetail(NetworkElementCommand.ROUTER_NAME, router.getInstanceName());
@@ -1641,7 +1641,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
}
command.setAccessDetail(SetMonitorServiceCommand.ROUTER_HEALTH_CHECKS_EXCLUDED, excludedTests);
- command.setHealthChecksConfig(getRouterHealthChecksConfig(router));
+ command.setHealthChecksConfig(routerHealthCheckConfig);
command.setReconfigureAfterUpdate(reconfigure);
command.setDeleteFromProcessedCache(deleteFromProcessedCache); // As part of updating
return command;
@@ -1666,7 +1666,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
s_logger.info("Updating data for router health checks for router " + router.getUuid());
Answer origAnswer = null;
try {
- SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true);
+ SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true, getRouterHealthChecksConfig(router));
origAnswer = _agentMgr.easySend(router.getHostId(), command);
} catch (final Exception e) {
s_logger.error("Error while sending update data for health check to router: " + router.getInstanceName(), e);
@@ -1891,7 +1891,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
.append(generateKeyValuePairOrEmptyString("server.maxqueue", serverMaxqueue));
}
- private Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) {
+ protected Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) {
Map<String, String> data = new HashMap<>();
List<DomainRouterJoinVO> routerJoinVOs = domainRouterJoinDao.searchByIds(router.getId());
StringBuilder vmsData = new StringBuilder();
@@ -2464,7 +2464,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
if (reprogramGuestNtwks) {
finalizeIpAssocForNetwork(cmds, router, provider, guestNetworkId, null);
finalizeNetworkRulesForNetwork(cmds, router, provider, guestNetworkId);
- finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true);
+ finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true, getRouterHealthChecksConfig(router));
}
finalizeUserDataAndDhcpOnStart(cmds, router, provider, guestNetworkId);
@@ -2478,7 +2478,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
}
protected void finalizeMonitorService(final Commands cmds, final VirtualMachineProfile profile, final DomainRouterVO router, final Provider provider,
- final long networkId, boolean onStart) {
+ final long networkId, boolean onStart, Map<String, String> routerHealthCheckConfig) {
final NetworkOffering offering = _networkOfferingDao.findById(_networkDao.findById(networkId).getNetworkOfferingId());
if (offering.isRedundantRouter()) {
// service monitoring is currently not added in RVR
@@ -2528,7 +2528,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
}
// As part of aggregate command we don't need to reconfigure if onStart and persist in processed cache. Subsequent updates are not needed.
- SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false);
+ SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false, routerHealthCheckConfig);
command.setAccessDetail(NetworkElementCommand.ROUTER_GUEST_IP, _routerControlHelper.getRouterIpInNetwork(networkId, router.getId()));
if (!isMonitoringServicesEnabled) {
command.setAccessDetail(SetMonitorServiceCommand.ROUTER_MONITORING_ENABLED, isMonitoringServicesEnabled.toString());
diff --git a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
index 18801eb01f..27958126e7 100644
--- a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
+++ b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
@@ -482,8 +482,9 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
throw new CloudRuntimeException("Cannot find related provider of virtual router provider: " + vrProvider.getType().toString());
}
+ Map<String, String> routerHealthCheckConfig = getRouterHealthChecksConfig(domainRouterVO);
if (reprogramGuestNtwks && publicNics.size() > 0) {
- finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true);
+ finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true, routerHealthCheckConfig);
}
for (final Pair<Nic, Network> nicNtwk : guestNics) {
@@ -495,7 +496,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
if (reprogramGuestNtwks) {
finalizeIpAssocForNetwork(cmds, domainRouterVO, provider, guestNetworkId, vlanMacAddress);
finalizeNetworkRulesForNetwork(cmds, domainRouterVO, provider, guestNetworkId);
- finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true);
+ finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true, routerHealthCheckConfig);
}
finalizeUserDataAndDhcpOnStart(cmds, domainRouterVO, provider, guestNetworkId);
@@ -554,7 +555,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
finalizeNetworkRulesForNetwork(cmds, router, provider, networkId);
}
- finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false);
+ finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false, getRouterHealthChecksConfig(router));
return _nwHelper.sendCommandsToRouter(router, cmds);
}
@weizhouapache Hi Wei,
I've actually managed to implement the patch myself.
The test looks promising: Without this patch applied, the 10-tier VPC took 61 minutes to fully restart With this patch applied, the same VPC took 9 minutes to fully restart
Can you take a look at it? If it looks good to you, I can make a PR with it.
- I changed getRouterHealthChecksConfig() to be protected as VpcVirtualNetworkApplianceManagerImpl is going to call it.
- Moved the getRouterHealthChecksConfig() upper to finalizeCommandsOnStart() in the execution stack. The health check config is now passed down the stack as a function argument.
- Adjusted other methods that use getRouterHealthChecksConfig().
diff --git a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java index f7bfb1c4af..3d236c0a13 100644 --- a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java +++ b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java @@ -1623,7 +1623,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM } private SetMonitorServiceCommand createMonitorServiceCommand(DomainRouterVO router, List<MonitorServiceTO> services, - boolean reconfigure, boolean deleteFromProcessedCache) { + boolean reconfigure, boolean deleteFromProcessedCache, Map<String, String> routerHealthCheckConfig) { final SetMonitorServiceCommand command = new SetMonitorServiceCommand(services); command.setAccessDetail(NetworkElementCommand.ROUTER_IP, _routerControlHelper.getRouterControlIp(router.getId())); command.setAccessDetail(NetworkElementCommand.ROUTER_NAME, router.getInstanceName()); @@ -1641,7 +1641,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM } command.setAccessDetail(SetMonitorServiceCommand.ROUTER_HEALTH_CHECKS_EXCLUDED, excludedTests); - command.setHealthChecksConfig(getRouterHealthChecksConfig(router)); + command.setHealthChecksConfig(routerHealthCheckConfig); command.setReconfigureAfterUpdate(reconfigure); command.setDeleteFromProcessedCache(deleteFromProcessedCache); // As part of updating return command; @@ -1666,7 +1666,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM s_logger.info("Updating data for router health checks for router " + router.getUuid()); Answer origAnswer = null; try { - SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true); + SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true, getRouterHealthChecksConfig(router)); origAnswer = _agentMgr.easySend(router.getHostId(), command); } catch (final Exception e) { s_logger.error("Error while sending update data for health check to router: " + router.getInstanceName(), e); @@ -1891,7 +1891,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM .append(generateKeyValuePairOrEmptyString("server.maxqueue", serverMaxqueue)); } - private Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) { + protected Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) { Map<String, String> data = new HashMap<>(); List<DomainRouterJoinVO> routerJoinVOs = domainRouterJoinDao.searchByIds(router.getId()); StringBuilder vmsData = new StringBuilder(); @@ -2464,7 +2464,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM if (reprogramGuestNtwks) { finalizeIpAssocForNetwork(cmds, router, provider, guestNetworkId, null); finalizeNetworkRulesForNetwork(cmds, router, provider, guestNetworkId); - finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true); + finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true, getRouterHealthChecksConfig(router)); } finalizeUserDataAndDhcpOnStart(cmds, router, provider, guestNetworkId); @@ -2478,7 +2478,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM } protected void finalizeMonitorService(final Commands cmds, final VirtualMachineProfile profile, final DomainRouterVO router, final Provider provider, - final long networkId, boolean onStart) { + final long networkId, boolean onStart, Map<String, String> routerHealthCheckConfig) { final NetworkOffering offering = _networkOfferingDao.findById(_networkDao.findById(networkId).getNetworkOfferingId()); if (offering.isRedundantRouter()) { // service monitoring is currently not added in RVR @@ -2528,7 +2528,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM } // As part of aggregate command we don't need to reconfigure if onStart and persist in processed cache. Subsequent updates are not needed. - SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false); + SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false, routerHealthCheckConfig); command.setAccessDetail(NetworkElementCommand.ROUTER_GUEST_IP, _routerControlHelper.getRouterIpInNetwork(networkId, router.getId())); if (!isMonitoringServicesEnabled) { command.setAccessDetail(SetMonitorServiceCommand.ROUTER_MONITORING_ENABLED, isMonitoringServicesEnabled.toString()); diff --git a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java index 18801eb01f..27958126e7 100644 --- a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java +++ b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java @@ -482,8 +482,9 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian throw new CloudRuntimeException("Cannot find related provider of virtual router provider: " + vrProvider.getType().toString()); } + Map<String, String> routerHealthCheckConfig = getRouterHealthChecksConfig(domainRouterVO); if (reprogramGuestNtwks && publicNics.size() > 0) { - finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true); + finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true, routerHealthCheckConfig); } for (final Pair<Nic, Network> nicNtwk : guestNics) { @@ -495,7 +496,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian if (reprogramGuestNtwks) { finalizeIpAssocForNetwork(cmds, domainRouterVO, provider, guestNetworkId, vlanMacAddress); finalizeNetworkRulesForNetwork(cmds, domainRouterVO, provider, guestNetworkId); - finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true); + finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true, routerHealthCheckConfig); } finalizeUserDataAndDhcpOnStart(cmds, domainRouterVO, provider, guestNetworkId); @@ -554,7 +555,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian finalizeNetworkRulesForNetwork(cmds, router, provider, networkId); } - finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false); + finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false, getRouterHealthChecksConfig(router)); return _nwHelper.sendCommandsToRouter(router, cmds); }
@phsm looks good. can you create a PR ?
fixed by #8385
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
Advance zone with isolated network
SUMMARY
One of our CloudStack environments has been upgraded from
4.17.2
to4.18.1
, last week. Then we restarted a big VPC with 11 tiers and ACLs that have 8 rules in them. After restarting that VPC, we noticed the following behavior:We checked these, they weren't the case:
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS
ADDITIONAL INFORMATION
We notice that the actual libvirt domain of VR was never even attempted to be provisioned on the hypervisor. Looks like the management server holds this job until it gets cancelled by some timeout.
Watching the management server log with grep by VR instance name shows the following behavior: