This document proposes a standard for declaring cluster network topology in Kubernetes, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality.
Motivation
Understanding the cluster network topology is essential for optimizing the placement of workloads that require intensive inter-node communication. Currently, there is no standardized way to represent this information in Kubernetes, making it challenging to develop control plane components and applications that can leverage topology awareness.
This information might be useful for various components and features, including:
Pod affinity sections in deployment and pod specs
Kueue network-aware scheduling
Future development of native scheduler plugins for topology-aware scheduling
Cluster Topology Sources
Cluster topology information can be derived from various sources:
Provided directly by a Cloud Service Provider (CSP)
Extracted from a CSP using specialized tools like "topograph"
Manually set up by cluster administrators
A combination of the above methods to ensure comprehensive coverage
Proposal
We propose new node label and annotation types to capture network topology information:
<nw-switch-type>: Logical type of the network switch (can be one of the reserved names or a custom name)
Reserved names: accelerator, block, datacenter, zone
<switch-name>: Unique identifier for the switch
Network QoS Annotation
Format: network.qos.kubernetes.io/switches: <QoS>
<QoS>: A JSON object where each key is a switch name (matching the network topology label) with a value containing:
distance: Numerical value representing the distance in hops from the node to the switch, required
latency: Link latency (e.g., 200 ms), optional
bandwidth: Link bandwidth (e.g., 100 Gbps), optional
This structure can be easily extended with additional network QoS metrics in the future.
Reserved Network Types
We have introduced reserved network types to better accommodate common network hierarchies. These reserved network types include the following predefined names and characteristics:
accelerator: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
block: Rack-level switches connecting hosts in one or more racks as a block.
datacenter: Spine-level switches connecting multiple blocks inside a datacenter.
zone: Zonal switches connecting multiple datacenters inside an availability zone.
When using reserved network types, Network QoS Annotations become optional. In the absence of these annotations, it is assumed that performance within each network layer is uniform.
The scheduler will prioritize switches according to the order outlined above, providing a standardized approach for network-aware scheduling across a range of configurations.
If provided, Network QoS Annotations can be used to refine and enhance the details of link performance, enabling more precise scheduling decisions.
Example of Network Topology Labels with reserved network types:
This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls.
For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions.
Example of network topology with custom network types
Kubernetes Enhancement Proposal: Cluster Network Topology Standardization
Summary
This document proposes a standard for declaring cluster network topology in Kubernetes, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality.
Motivation
Understanding the cluster network topology is essential for optimizing the placement of workloads that require intensive inter-node communication. Currently, there is no standardized way to represent this information in Kubernetes, making it challenging to develop control plane components and applications that can leverage topology awareness.
This information might be useful for various components and features, including:
Cluster Topology Sources
Cluster topology information can be derived from various sources:
Proposal
We propose new node label and annotation types to capture network topology information:
Network Topology Label
Format:
network.topology.kubernetes.io/<nw-switch-type>: <switch-name>
<nw-switch-type>
: Logical type of the network switch (can be one of the reserved names or a custom name)accelerator
,block
,datacenter
,zone
<switch-name>
: Unique identifier for the switchNetwork QoS Annotation
Format:
network.qos.kubernetes.io/switches: <QoS>
<QoS>
: A JSON object where each key is a switch name (matching the network topology label) with a value containing:distance
: Numerical value representing the distance in hops from the node to the switch, requiredlatency
: Link latency (e.g., 200 ms), optionalbandwidth
: Link bandwidth (e.g., 100 Gbps), optionalThis structure can be easily extended with additional network QoS metrics in the future.
Reserved Network Types
We have introduced reserved network types to better accommodate common network hierarchies. These reserved network types include the following predefined names and characteristics:
accelerator
: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)block
: Rack-level switches connecting hosts in one or more racks as a block.datacenter
: Spine-level switches connecting multiple blocks inside a datacenter.zone
: Zonal switches connecting multiple datacenters inside an availability zone.When using reserved network types, Network QoS Annotations become optional. In the absence of these annotations, it is assumed that performance within each network layer is uniform.
The scheduler will prioritize switches according to the order outlined above, providing a standardized approach for network-aware scheduling across a range of configurations.
If provided, Network QoS Annotations can be used to refine and enhance the details of link performance, enabling more precise scheduling decisions.
Example of Network Topology Labels with reserved network types:
Example of Network QoS Annotations that complements the example above:
Extensibility and Future-Proofing
This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls.
For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions.
Example of network topology with custom network types
Node Labels:
Node Annotations: