apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.98k stars 1.09k forks source link

Autoscale instance creation issue #9503

Open hari1822 opened 1 month ago

hari1822 commented 1 month ago
ISSUE TYPE
COMPONENT NAME
AutoScale Instance Groups
CLOUDSTACK VERSION
CloudStack 4.19.1.0
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

While creating autoscale in the parameter field when we mention the maximum number of instance as 3 the instance is created over the mentioned value. The created instance are in the state of stopped and some are in the error state. Nearly 1000 instance are created. This issue occurs when the created instance are in stopped or in error state.

STEPS TO REPRODUCE
Create the autosccale policy with the following :
Public port
20
Private port
20
Min members
2
Max members
3
Available Instances
1
Polling interval (in sec)
60
Expunge Instance grace period (in sec)
60
Scaleup Policy:
Counter Operator
VM CPU - average percentage
Threshold
Greater than    40
ScaleDown policy:
Counter Operator
VM CPU - average percentage
Threshold
Lesser than 40
EXPECTED RESULTS
The instance that is created should be equal to the given maximum value. Even though the instance is not in the running state.
ACTUAL RESULTS
The instance is keep on creating beyond the given maximum value of instance should be created. 
The disable option is also not working for the created autoscale policy.
![Screen Shot 2024-08-08 at 15 08 27-fullpage](https://github.com/user-attachments/assets/0d32bf50-8f41-4756-982e-71b508533f41)
![Screen Shot 2024-08-08 at 15 07 10-fullpage](https://github.com/user-attachments/assets/ca26bcd2-4595-4e36-a14f-e58a1475ef20)
![Screen Shot 2024-08-08 at 15 06 35-fullpage](https://github.com/user-attachments/assets/b4ebb35f-90c9-452d-bb79-1e8375144800)
boring-cyborg[bot] commented 1 month ago

Thanks for opening your first issue here! Be sure to follow the issue template!

btzq commented 1 month ago

Hey @hari1822 , my team faced this issue before resulting in 30K Autoscale VMs created.

Our findings was that this happens when the VM is unable to be created sucessfully from the VM Template. But, Cloudstack tries really hard to start up a VM, so it keeps retrying on loop, forever.

I created a ticket here reporting the issue, suggesting better handling to avoid infinite loops: https://github.com/apache/cloudstack/issues/9318

Anyways, in our case, what was causing the issue was either:

We were using Linstor as the SDS Storage, but Linbit managed to resolve the issue for us and we were able to create Autoscale VMs without issues ever since.

What stroage are you using?

hari1822 commented 1 month ago

When this problem arises, we are unable to delete or disable the auto scale group, and the scaling occurs within the given interval.

btzq commented 1 month ago

@hari1822 , okay thats new for us.

When we encountered this issue, we were able to disable the Autoscale Group.

We then did either 1 of the 2 options:

But note, in Option 1, we encountered the UI crashing a few times and DB going 100%.

In this Option 2, we felt it was okay because the VM itself was not created yet. Just a record of its attempt.

I have 2 Questions

hari1822 commented 1 month ago

NFS- is used for storage.

While trying to delete the Autoscale Group : Failed to remove the load balancer rule. If we try to delete the load balancer rule : Unable to remove the loadbalancer rule.

2024-08-07 12:39:48,283 DEBUG [o.a.c.n.t.BasicNetworkTopology] (API-Job-Executor-23:ctx-7e771bbf job-7704 ctx-cd0c46cb) (logid:4ed65a97) Router r-817-VM is in Stopped, so not sending apply ip association commands to the backend
2024-08-07 12:39:48,292 DEBUG [o.a.c.n.t.BasicNetworkTopology] (API-Job-Executor-23:ctx-7e771bbf job-7704 ctx-cd0c46cb) (logid:4ed65a97) APPLYING LOAD BALANCING RULES
2024-08-07 12:39:48,293 DEBUG [o.a.c.n.t.BasicNetworkTopology] (API-Job-Executor-23:ctx-7e771bbf job-7704 ctx-cd0c46cb) (logid:4ed65a97) Router r-817-VM is in Stopped, so not sending apply loadbalancing rules commands to the backend
2024-08-08 00:03:19,922 DEBUG [c.c.h.x.r.CitrixResourceBase] (DirectAgent-139:ctx-669704e5) (logid:6bee0f59) Trying to connect to
169.254.233.95 attempt 68 of 100
2024-08-08 00:03:20,817 ERROR [c.c.u.s.SshHelper] (DirectAgent-298:ctx-a647b11e) (logid:cd57d95b) SSH execution of command /opt/cl
oud/bin/router_proxy.sh update_config.py 169.254.0.49 vm_dhcp_entry.json.48497187-69e7-48cd-b6a7-f3214d3d0015 has an error status
code in return. Result output:
2024-08-08 00:03:20,818 DEBUG [c.c.a.r.v.VirtualRoutingResource] (DirectAgent-298:ctx-a647b11e) (logid:cd57d95b) Processing Script
ConfigItem, executing update_config.py vm_dhcp_entry.json.48497187-69e7-48cd-b6a7-f3214d3d0015 took 7114ms
2024-08-08 00:03:20,818 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgent-298:ctx-a647b11e) (logid:cd57d95b) Seq 1-42665976969801610
48: Response Received:
2024-08-08 00:03:20,818 DEBUG [c.c.a.t.Request] (DirectAgent-298:ctx-a647b11e) (logid:cd57d95b) Seq 1-4266597696980161048: Process
ing:  { Ans: , MgmtId: 275890944841813, via: 1(wolfapp2-xen), Ver: v1, Flags: 10, [{"com.cloud.agent.api.routing.GroupAnswer":{"re
sults":["null - failed: ","null - failed: "],"result":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2024-08-08 00:03:20,818 DEBUG [c.c.a.t.Request] (Work-Job-Executor-112:ctx-d48372fe job-7705/job-8169 ctx-772376bc) (logid:cd57d95
b) Seq 1-4266597696980161048: Received:  { Ans: , MgmtId: 275890944841813, via: 1(wolfapp2-xen), Ver: v1, Flags: 10, { GroupAnswer
 } }
2024-08-08 00:03:20,818 WARN  [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-112:ctx-d48372fe job-7705/job-8169 ctx-772376bc
) (logid:cd57d95b) Unable to contact resource.
com.cloud.exception.ResourceUnavailableException: Resource [DataCenter:1] is unreachable: Unable to apply dhcp entry on router
btzq commented 1 month ago

@hari1822 i see

im not any good at reading logs, but it looks like your Virtual Router is stopped? If so i think thats a bigger issue you should look into first.

Do you have other VPCs and Virtual Routers working okay?

And for Autoscale, remember you need to disable the autoscale group first before being able to delete it.

If you try to delete an Autoscale Group that is still enabled, it will throw and error.

Only when an Autoscale Group is disabled can you delete it, or make changes to load balancer etc

hari1822 commented 1 month ago

@btzq Will look into it