antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.67k stars 369 forks source link

Support automatic failover for Egress #2128

Closed tnqn closed 3 years ago

tnqn commented 3 years ago

Describe what you are trying to solve

Currently the Egress feature requires the Egress IP must be assigned to an arbitrary interface of one Node manually, and users have to reassign the IP to another Node if the previous one becomes unavailable. This is not ideal for production where automatic failover is desired. The proposal describes a solution that supports automatic failover for Egress.

Describe how your solution impacts user flows

When creating an Egress, user can specify its failover policy, which has two options: "None" and "Auto". By "None", the Egress will just work like before. By "Auto", Antrea will take care of selecting a Node from eligible Nodes and assigning the Egress's IP to the Node's network interface, and moving it to another Node if the current Node becomes unavailable.

Describe the main design/architecture of your solution

API change

Add a new field FailoverPolicy which allows users to specify if the Egress IP should failover to another Node when the Node that holds it becomes unavailable.

+type EgressFailoverPolicyType string
+
+const (
+       // EgressFailoverPolicyTypeNone disables failover for this Egress. The
+       // Egress IP must be assigned to an arbitrary interface of one Node, and be
+       // reassigned to another Node manually in case of Node failure.
+       EgressFailoverPolicyTypeNone EgressFailoverPolicyType = "None"
+       // EgressFailoverPolicyTypeAuto enables auto failover for this Egress. The
+       // Egress IP must not be assigned to any interface of any Node in advance.
+       EgressFailoverPolicyTypeAuto EgressFailoverPolicyType = "Auto"
+)
+
 // EgressSpec defines the desired state for Egress.
 type EgressSpec struct {
        // AppliedTo selects Pods to which the Egress will be applied.
        AppliedTo AppliedTo `json:"appliedTo"`
        // EgressIP specifies the SNAT IP address for the selected workloads.
        EgressIP string `json:"egressIP"`
+       // FailoverPolicy specifies if the Egress IP should failover to another Node
+       // when the Node that holds it becomes unavailable.
+       // The default value is "None".
+       FailoverPolicy *EgressFailoverPolicyType `json:"failoverPolicy"`
+}

The reason why the proposal keeps the "None" failover policy is, there may be such use cases:

Egress Node failure detection

To support automatic failover, first we need to detect Node failure. There are two kinds of solutions to do it:

In practice, the first approach typically takes around 40 seconds to detect Node failure. It strongly relies on kube-controller-manager's node-monitor-grace-period option, which defaults to 40 seconds. Besides, it requires antrea-controller to be health so that it can observe the failure and initiate failover, i.e. reassigning Egress IPs that were assigned to the failed Node to others.

The second approach relies on neither K8s control plane nor antrea-controller. The antrea-agents will talk to each other directly using specific protocol directly. memerlist is Go library that implements it based on a gossip based protocol. Typically it can detect Node failover in 3 seconds with default configuration, and it can be tuned. The library is quite re-usable, the code that creates and joins a cluster is like below:

// The Node join/leave events will be notified via it. 
nodeEventCh = make(chan memberlist.NodeEvent, 1024)

mconfig := memberlist.DefaultLANConfig()
mconfig.Name = nodeName
mconfig.BindAddr = "0.0.0.0"
mconfig.BindPort = 10351
mconfig.Events = &memberlist.ChannelEventDelegate{Ch: nodeEventCh}
memberlist, err := memberlist.Create(mconfig)
if err != nil {
    klog.Errorf("failed to create memberlist: %v", err)
    return nil, err
}

var nodeIPs []string
nodes, _ := nodeLister.List(labels.Everything())
for _, node := range nodes {
    nodeIP, err := noderoute.GetNodeAddr(node)
    if err != nil {
        klog.Errorf("failed to get address of Node %v: %v", node, err)
        continue
    }
    nodeIPs = append(nodeIPs, nodeIP.String())
}

numSuccess, err := l.memberList.Join(nodeIPs)

memberlist is currently used by MetalLB and Consul as the member failure detection solution.

To provide a faster failover period and have less dependency on control plane, we propose to use the second approach to implement the failure detection.

Egress Node selection && IP assignment

Based on the above Node failure detection mechanism, each agent can get the list of the health Egress Nodes. For each Egress, the agent calculates its owner Node in a deterministic way (hash), and if the Node that the agent it runs on is the owner of the Egress, it assigns the Egress IP to its network interface. In theory, each Egress Node will own num_egresses/num_nodes Egresses.

When determining an Egress's owner Node, simply hashing the Egress and mapping it a Node via a modular operation will cause nearly all Egresses to be remapped when a Node joins or leaves the cluster, which will affect the whole Egress traffic of the cluster. To avoid it, we propose to use consistent hashing for Node selection. In this way, only num_egresses/num_nodes will be remapped when a Node joins or leaves the cluster. One way of doing that is following how MetalLB determines whether a Node should announces a Service's ingress IP: the agent hashes the Egress and each eligible Node, sorts the hashes, and selects the first one as the owner.

Then each agent will do the following:

  1. Join the memberlist cluster on startup if itself is an eligible Egress Node
  2. Get the active members (Nodes) of the memberlist cluster, iterate all Egresses that enables failover, calculate their owner Nodes
  3. If the agent's own Node is an Egress's owner, it assigns the Egress IP to the network interface that has an IP in the same subnet as the Egress IP. Otherwise it removes the Egress IP from the Node's network interfaces if it's present.
  4. Announce the Egress IPs via ARP/NDP protocols to refresh neighbors' cache.

Limit Egress Nodes

In some cases, users may want only certain Nodes to be the Egress Nodes. To support it, there could be 3 options:

Status report (Optional)

To expose Egresses' owner Node to users, we could add EgressStatus to Egress API:

type EgressStatus struct {
    // The name of the Node that holds the Egress IP.
    nodeName string `json:"nodeName"`
}

When an agent assigns an Egress's IP to its own Node, it should update the Egress's status with its own NodeName. In addition, it could create an Event associated with the Egress, so users can see the migration history from K8s API.

Work breakdown

Test plan

TBD

antoninbas commented 3 years ago

@tnqn this looks great to me, thanks for the detailed proposal. A few notes:

tnqn commented 3 years ago

@antoninbas Thanks for the quick review.

  • does FailoverPolicy have to be a pointer? Or can it be like ExternalTrafficPolicy in the K8s ServiceSpec?

I was considering compatability if some Egresses have been persistent in etcd without the field. Will check how is this case handled in K8s. Thanks for the suggestion.

  • the proposal doesn't give any information about IP address advertisement in case of failover? I suppose we need to send a gratuitous ARP message at least?

Yes, we need ARP annoucement for IPv4 and NDP for IPv6. will add this to the design.

  • for "Limit Egress Nodes", my favorite option would be 2: Have a per-Egress NodeSelector in the Spec of Egress.. I feel like this is the most natural / idiomatic. It works well for the use case I have in mind (cluster with Nodes over multiple AZs, where Nodes are labelled with their AZ; I want to bind an Egress IP to a specific AZ).

I recall @jianjuns had a concern that it would require users to duplicate NodeSelector for all Egresses. We could discuss more about it here or in the community meeting.

  • a bit of an orthogonal question: do we get any cool feature "for free" if we use the memberlist library? I was thinking of exposing inter-Node latency to the user as a dashboard. Can memberlist provide that latency information to us?

Good idea. Will check it and get back to you.

tnqn commented 3 years ago

@jianjuns do you think "Have a per-Egress NodeSelector in the Spec of Egress" can work for the cases you mentioned that a subset of Nodes can hold some IPs and another subnet of Nodes can hold other IPs? Or you have concern on duplicating labelSelectors in Egresses. In some way, specifying labelSelector (if it's quite simple) is same as specifying the name of NodePool?

jianjuns commented 3 years ago

@tnqn : as we talked in the community meeting, to be most flexible, I think we should separate Egress IP - Node association. Originally I was thinking an IPPool and a NodePool CRD (see #667), and one of them refers to the other for the association. Then Egress IP belongs to a pool should be configured to only a Node in the associated the Node pool. The same design can be reused by other IP allocation case, like LB VIP, and maybe routable Pod IP allocation.

I think finally we will need to provide the flexibility, as I saw it is a typical deployment pattern that one K8s cluster can span across multiple Node network segments. I am fine with starting from a single global NodePool, but considering the future extension, I would suggest the following:

  1. We can go one of your 1) and 3) to define the global NodePool, and maybe we can say it is the default pool. So, later when we support multiple NodePools, Egress IPs without explicit Node association can still go the default pool.
  2. I would remove FailoverPolicy from Egress, but decide whether or not to auto failover based on if a default pool is configured or not.

Let me know what you think.

jianjuns commented 3 years ago

@jianjuns do you think "Have a per-Egress NodeSelector in the Spec of Egress" can work for the cases you mentioned that a subset of Nodes can hold some IPs and another subnet of Nodes can hold other IPs? Or you have concern on duplicating labelSelectors in Egresses. In some way, specifying labelSelector (if it's quite simple) is same as specifying the name of NodePool?

I would associate IPs (not Egresses) with Nodes. The reasons are: 1) indeed the relationship is between IPs and Nodes; 2) later we might do egress IP allocation for Egresses; 3) multiple Egresses might share one egress IP; 4) later we might support other IP assignment cases like LB VIP and might like a common way to define IP - Node mapping.

tnqn commented 3 years ago

@jianjuns I agree IP and Node association makes more sense. About the default NodePool and removing FailoverPolicy, whether it must be created/configured to enable auto-failover? I thought for users that just have a basic flat Network, it's redundant to ask them to do such configuration. Do you think we could just use the whole cluster as the default pool if we are going to have NodePool resources later? For basic topology, user can just create the Egress and expect all Nodes to be the Egress Nodes. For advanced topology, user can create NodePool and IPPool and expect the IP to be configured on proper Nodes.

jianjuns commented 3 years ago

But I do not think failover policy should be configured on Egress.

I thought once you configure a NodePool (a single pool for now), it means you want to enable auto-failover (otherwise why even configure it?). Do you agree?

tnqn commented 3 years ago

I meant not creating the default NodePool explicitly. Just like all Nodes can be candicate if you don't specify any NodeSelector for a Pod. In the way you proposed, do user have to create a Pool before they can get failover feature? The knob sounds a little obscure to me: If I have only static IPs and don't want Antrea to migrate the IPs I have configured on Nodes, I shouldn't create a NodePool. If I have floating IPs and want Antrea to auto-assign and auto-failover the IPs, I should create a NodePool even it just selects all Nodes.

jianjuns commented 3 years ago

I feel it might be a little obscure, but not too bad in my mind. You want auto-assignment and failover, then you should define a scope for that, even it is the whole cluster?

What is your proposal then? I do not like to associate failover policy with Egress for the reasons described above.

tnqn commented 3 years ago

@jianjuns to associate with Egress IPs with Nodes, what do you think about this way? Add an IPPool CRD, which defines the subnet, ip ranges and associated Nodes:

// EgressIPPool defines an IP Pool that Egresses can allocate IPs from.
type EgressIPPool struct {
    // The CIDR of this IP pool, e.g. 10.10.10.0/24.
    CIDR         string
    // The IP ranges of this IP pool, e.g. 10.10.10.2-10.10.10.20, 10.10.10.30-10.10.10.50, 10.10.10.60.
    IPRanges     []IPRange
    // The Nodes that the Egress IPs can be assigned to. If empty, it means all Nodes.
    NodeSelector *metav1.LabelSelector
}

type IPRange struct {
    Start string
    End   string
}

Then we could add an optional pool field to EgressSpec to indicate which IPPool the EgressIP should be allocated from, and failover will be enabled when the IP pool is specified:

// EgressSpec defines the desired state for Egress.
type EgressSpec struct {
    // AppliedTo selects Pods to which the Egress will be applied.
    AppliedTo AppliedTo `json:"appliedTo"`
    // EgressIP specifies the SNAT IP address for the selected workloads.
    EgressIP string `json:"egressIP"`
    // EgressIPPool specifies the IP Pool that the EgressIP should be allocated from.
    // If it's set, the allocated EgressIP will be assigned to a Node specified by the Pool and failover to another Node
    // if the assigned Node becomes unavailable.
    // If it's set with EgressIP, the EgressIP must be in the IP Pool.
    EgressIPPool *EgressIPPool `json:"egressIPPool"`
}

Currently I don't see the need for NodePool CRD and it seems a NodeSelector can solve it which is a common way to define a pool of Nodes in K8s and simpler to use. Please let me know what you think. If you think the other IP allocation cases can reuse the pool, we can remove the prefix "Egress" from the name of IP Pool.

jianjuns commented 3 years ago

@tnqn : do you mean we will support Egress IP allocation? I thought it will not be in the next release?

For IPPool CRD, I like to make it generic, so it can be used by other use cases too like LB VIP allocation. LB VIPs share common requirements as Egress IPs, so your definition should work. However, it is not very clear to me if we will use the same IPPool to support allocating Pod IPs (e.g. for secondary interfaces, or pool per Namespace/Service). If so, we should support subnet information to the pool. And should we still add NodeSelector to a pool? I mean there can be two models: 1) IPPool is just a pool of IPs but it can be associated with objects in some other way (e.g. annotating the Namespace); 2) IPPool selects associated objects. What are your thoughts here?

Another thing is, if we set up a Node election group because an Egress refers to an IPPool which selects these Nodes, it sounds indirect? Your EgressIPPool way sounds better, but I still like the idea of a generic IPPool CRD for multiple use cases. What you think?

tnqn commented 3 years ago

@tnqn : do you mean we will support Egress IP allocation? I thought it will not be in the next release?

Yes, I want to make auto-failover enablement more explicit to users when they create Egress. Defining an standalone IPPool or NodePool but not associating them with Egress sounds obscure to me and it seems we would reuse the Pool resources for other features which would make it more obscure.

For IPPool CRD, I like to make it generic, so it can be used by other use cases too like LB VIP allocation. LB VIPs share common requirements as Egress IPs, so your definition should work. However, it is not very clear to me if we will use the same IPPool to support allocating Pod IPs (e.g. for secondary interfaces, or pool per Namespace/Service). If so, we should support subnet information to the pool. And should we still add NodeSelector to a pool? I mean there can be two models: 1) IPPool is just a pool of IPs but it can be associated with objects in some other way (e.g. annotating the Namespace); 2) IPPool selects associated objects. What are your thoughts here?

I was thinking something like SubnetPool to support Pod IPs allocation. It sounds complex to support individual IP allocation and subnet allocation with single resource. It might also be hard to track its usage/reference if we let multiple other resources referring to them to claim their association. For the Egress feature, if we don't add NodeSelector to the pool, you mean creating a NodePool resource and let the NodePool refer to IPPool?

Another thing is, if we set up a Node election group because an Egress refers to an IPPool which selects these Nodes, it sounds indirect? Your EgressIPPool way sounds better, but I still like the idea of a generic IPPool CRD for multiple use cases. What you think?

Could you elaborate more about the question? I thought you prefer to assocate IP with Node, not Egress with Node? https://github.com/antrea-io/antrea/issues/2128#issuecomment-827318435.

jianjuns commented 3 years ago

I feel it is ok if we want to do only failover first. As you proposed originally, we can have a NodeSelector parameter in the ConfigMap, and we can use a name like DefaultEgressNodes, which should be clear enough Egress IPs can be assigned to these Nodes.

For Pod IP allocation, I feel it is not really different from Egress IP or VIP allocation. It is still a pool of IPs, but just that IPs are associated with a subnet and gateway (they are not really relevant to IP allocation, but just extra information appended).

For my last comment, I meant if we use a generic IPPool CRD which includes a Node Selector, then the trigger of "creating a failover Node group with these Nodes" will be "the IPPool is referred by an Egress", which sounds a little indirect to me. But maybe it is not too bad either. I like to learn your thoughts.

tnqn commented 3 years ago

@jianjuns Is static Egress IP that don't need failover a valid case to you? And a combination of static Egress IP and failover-able Egress IP in same cluster a valid case? For example, some pods may just use a specified Node's primary IP as Egress IP, while some others use floating IPs. The FailoverPolicy or EgressIPPool field added to EgressSpec is to make them work together. Otherwise starting from failover only and a NodeSelector config is definitely simpler.

Since IPPool could have different Node selectors. I think we don't really create a failover group for each Pool, but a global group for all IPPools. An agent assigns the Egress IP to itself only when it has the maxmium hash value across the Nodes that are selected by the IPPool's NodeSelector and are active. Does it make sense to you and address the concern?

jianjuns commented 3 years ago

And a combination of static Egress IP and failover-able Egress IP in same cluster a valid case?

I feel it is not very important to support this case. Of course I am not saying not good to have per IPPool settings, but I am just saying it might be ok to start from a global setting, if we plan to develop this feature across releases.

I got what you mean by a single global group. Originally I thought smaller groups can be configured based on failure domains, and can scale better (but there can be complexity like one Node are in multiple groups). What you think?

tnqn commented 3 years ago

And a combination of static Egress IP and failover-able Egress IP in same cluster a valid case?

I feel it is not very important to support this case. Of course I am not saying not good to have per IPPool settings, but I am just saying it might be ok to start from a global setting, if we plan to develop this feature across releases.

We may not be able to catch 1.1 anyway so I assume we have some time to support per IPPool setting and it doesn't sound complex to implement. Except the API change, I imagine we just need antrea-controller to maintain the IP usage of IPPools in memory and allocate an IP from the speicfied IPPool if an Egress doesn't have IP specified.

I got what you mean by a single global group. Originally I thought smaller groups can be configured based on failure domains, and can scale better (but there can be complexity like one Node are in multiple groups). What you think?

Smaller groups would also introduce extra traffic if a Node can be in multiple groups and it would be complicated to negotiate the port that will be used by each group. A global group doesn't have to include all Nodes. It could be this: a Node joins the global group when it's selected by any IPPool.

antoninbas commented 3 years ago

@tnqn should we close this issue?

tnqn commented 3 years ago

@tnqn should we close this issue?

Yes, updated task lists and closing this.