Network interface speed

horazont commented 3 years ago

See SovereignCloudStack/Docs#98 for my definition of Prio 1-4.

Prio 3: As a Cloud Operator, I want to know when a network interface drops below the specified maximum speed on that link, as that indicates bad cabeling, a bad driver, or another (hardware) issue and may cause customer impact (reduced performance).
Prio 2: As a Cloud Operator, I want to know when the network interfaces of a compute node are saturated, as that causes degration of performance.

garloff commented 3 years ago

On the second piece:

Isn't it normal for a NIC to be saturated from time to time? Even if you do some bandwidth management (e.g. with Linux HTB traffic control), you still allow the full bandwidth to a single VM (if the others are silent) or to the VMs together. So I would expect saturated NICs to be rather normal for a highly utilized compute host.
There is typically nothing an operator could do short-term, except maybe live-migrating VMs to less busy hosts. This might be a possibility if such an event happens very seldomly. I still wonder whether you would want to be woken up during the night for this. And assuming my statement on this being a fairly frequent thing, we'd either have automated initiation of live migrations or ignore this. (We could however think about a prio 3 alarm if the workload management system does not find a good target host... If a live migration failed, we need to check whether the VM has survived. If not, this would be a prio 1 thing.)

horazont commented 3 years ago

You’re right that Prio 3 or 4 may be much more fitting for that one.

SovereignCloudStack / standards