apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.1k stars 1.11k forks source link

No. of CPU Cores in Compute Cluster doesnt display correctly when Cluster has oversubscription more than 1 #9693

Open btzq opened 1 month ago

btzq commented 1 month ago
ISSUE TYPE
COMPONENT NAME
Compute Cluster UI
CLOUDSTACK VERSION
4.19.1.1
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

We have 2 Compute Clusters:

Refering to the screenshot below, Cluster 1 is showing the correct No. of CPU Cores allocated.

But for Cluster 2, it seems inaccurate. The number of allocation exceeded whats available. Resulting in a UI error.

Screenshot 2024-09-17 at 10 23 55 PM

Screenshot 2024-09-17 at 10 24 05 PM

But the bigger problem now, is that both clusters have exceeded the allocation threshold, but there was no notification sent. And neither did cloudstack stop users from creating new virtual machines from the clusters.

Without this, Admins would not be able to ensure n+1 sufficient capacity in the event of a node failure.

Screenshot 2024-09-17 at 10 34 14 PM

In summary, there are 3 Issues:

STEPS TO REPRODUCE
EXPECTED RESULTS
- No. Of Allocated CPU Cores should display the full available amount, after oversubscription
- Admin should get Notification if exceeded the global setting threshold (cluster.cpu.allocated.capacity.notificationthreshold)
- User should not be able to create resources into the cluster after exceeding global setting threshold (cluster.cpu.allocated.capacity.disablethreshold)
ACTUAL RESULTS
- No. Of Allocated CPU Cores should not displaying total available cores correctly, after oversubscription.
- Admin did not get Notification if exceeded the global setting threshold (cluster.cpu.allocated.capacity.notificationthreshold)
- User is able to create resources into the cluster after exceeding global setting threshold (cluster.cpu.allocated.capacity.disablethreshold)
weizhouapache commented 1 month ago

@btzq the # of CPU cores does not take the overprovisioning factor into consideration.

btzq commented 1 month ago

@weizhouapache i see, is this expected?

And if it doesnt take overprovisioning into account, does this mean the global settings will not work as intended for clusters with overprovisioning?

weizhouapache commented 1 month ago

@weizhouapache i see, is this expected?

And if it doesnt take overprovisioning into account, does this mean the global settings will not work as intended for clusters with overprovisioning?

  • (cluster.cpu.allocated.capacity.notificationthreshold)
  • (cluster.cpu.allocated.capacity.disablethreshold)

@btzq The cpu capacity used in resource calculation and vm allocation is cpu cores cpu speed overprovisioning factor It does consider the overprovisioning factor. so no issues.

btzq commented 1 month ago

@weizhouapache Does this mean that:

Only triggers if the 'CPU' Field (Not # of CPU Cores) exceed the threshold?

But not all CPUs are 2,000Mhz. AMD 9554 is 3.1Ghz. And in the scenario of a mix cluster, it becomes even more complicated?

And when a node fails, how does cloudstack determine which remaining nodes the VM should failover to? Is it based on 'CPU'? Or ''# of CPU Cores)'?

weizhouapache commented 1 month ago

@weizhouapache Does this mean that:

  • (cluster.cpu.allocated.capacity.notificationthreshold)
  • (cluster.cpu.allocated.capacity.disablethreshold)

Only triggers if the 'CPU' Field (Not # of CPU Cores) exceed the threshold?

But not all CPUs are 2,000Mhz. AMD 9554 is 3.1Ghz. And in the scenario of a mix cluster, it becomes even more complicated?

And when a node fails, how does cloudstack determine which remaining nodes the VM should failover to? Is it based on 'CPU'? Or ''# of CPU Cores)'?

all operations are based on "CPU". the host with faster cpu (in mhz) is considered to have more cpu resources than the hosts with slower cpu.

the # of CPU Cores is only returned in the listCapacity response and displayed on the dashboard. that's all. it was introduced in commit 088cca2b nothing will change even if we remove the capacity type CAPACITY_TYPE_CPU_CORE and related codes.

btzq commented 1 month ago

@weizhouapache I went through this explanation:

https://github.com/apache/cloudstack/issues/6743

In this case, would it make sense for us to set 1.0Ghz all CPU? This would mean all instance have the same share.

That way, the 'CPU' field will display the same value as the number of core allocated and remaining.

We just have to make sure that 'CPU Cap' in the Compute Offering is disabled? That way, the number of remaining CPU is the same as the # of Cores left, and there will be no change to the guest VM performances?

weizhouapache commented 1 month ago

@weizhouapache I went through this explanation:

6743

In this case, would it make sense for us to set 1.0Ghz all CPU? This would mean all instance have the same share.

Yes, I think so. Actually I think we should have a global setting to indicate whether the cpu speed must be same and the value of the cpu speed. In many use cases, the cpu speed is totally useless.

That way, the 'CPU' field will display the same value as the number of core allocated and remaining.

We just have to make sure that 'CPU Cap' in the Compute Offering is disabled? That way, the number of remaining CPU is the same as the # of Cores left, and there will be no change to the guest VM performances?

btzq commented 1 month ago

Hey @weizhouapache , ive checked and it seems the GHZ metric comes from the individual host.

Currently, our hosts are all set to 2.00Ghz.

How do we change it to 1.00Ghz? Do we have to manually change in the DB? or is there a way to do it via API/UI or such?

Screenshot 2024-09-30 at 11 12 32 AM

weizhouapache commented 1 month ago

Hey @weizhouapache , ive checked and it seems the GHZ metric comes from the individual host.

Currently, our hosts are all set to 2.00Ghz.

How do we change it to 1.00Ghz? Do we have to manually change in the DB? or is there a way to do it via API/UI or such?

@btzq the cpu/ram values are reported to mgmt server and updated in DB, each time when cloudstack-agent is restarted on kvm hosts. You can update the db, but you have to run it every few minutes.

If all your hosts have the same cpu speed (in Mhz, no matter what the value is), I think the change is unnecessary.

btzq commented 1 month ago

@weizhouapache i see, we just checked as well and it seems the Mhz is reported by the KVM Hypervisor. The lowest we can set it to is 1.5Ghz, which doesnt meet our requirements... We just want to accurately know the total Available and Allocated Cores (after Hyperthread + Oversubscription Ratio).

We are going to try getting metrics via Prometheus to our Zabbix Server, then set triggers there. Hopefully we can get the info required from each cluster to create the triggers necessary...

But this will only be a temporary solution. I think Cloudstack should handle this natively so operator can always ensure N+1 for failovers/node failures.

btzq commented 4 weeks ago

Hi @weizhouapache ,

I have thought of another workaround but would like to check with you.

All of our Compute Offerings are currently set to 1000Mhz. Is it possible to change it to 2000Mhz?

If we could, the 'CPU Allocated' metric in each Cluster would have the right %. We just have to ignore the Mhz displayed in the Cluster.

This solution only works if: The Mhz of physical servers within the cluster AND the Mhz in the Comptue Offering to be run on the cluster is the same.