berops / claudie

Cloud-agnostic managed Kubernetes
https://docs.claudie.io/
Apache License 2.0
579 stars 39 forks source link

Feature: POC for LoadBalancing #54

Closed bernardhalas closed 2 years ago

bernardhalas commented 3 years ago

Child tasks from #44.

Motivation:

We need to find a feasible LB setup for multi-cloud and hybrid-cloud deployments. In order to be able to start the implementation of the LBs in platform, we want to run a POC on the architectural setup.

Description:

This task is open to figure a way how to deploy LBs for K8s API and Ingress controller(s). Then to run POC of such a setup and to run basic tests on how it's gonna behave. Once we find a working mode, we should assess whether the LB architecture is gonna work for hybrid-cloud setups as well.

Exit criteria:

MarioUhrik commented 3 years ago

During the meeting, we've discussed a particular potential solution to this problem with @miroslavkohutik and @bernardhalas. You may use this idea for the purposes of this task.

There are generally 2 ways: 1) utilizing "load balancer as a service" offerings of the public cloud providers, in unison with their networking offerings, to hook up our Wireguardian VPN with such an LB 2) creating our own improvised load balancers out of virtual machines, e.g. using HAproxy

We prefer the second option at the moment, for simplicity and adaptability. This could technically work for both on-premise and public cloud machines, but we'll probably make it a requirement for them to have public IP addresses. In addition, to guarantee sufficient availability of such an LB solution, we're thinking there should be (at least) 2 of such machines under one IP address, such that it's resilient to failures of some of the LB machines making up the "LB cluster".

The "LB clusters" will have one or more "roles" - e.g. "apiserver" and/or "ingress". There could probably be an arbitrary number of such "LB clusters" per K8s cluster (Zero to N). From the user perspective, later on, this could be customizable in the input config, very similarly to how we were thinking nodepools should work.

bernardhalas commented 3 years ago

I'd like to add one more option to the list. Theoretically, we could rely on DNS-based round-robin loadbalancing. If we tune it well, we can sustain an outage of a single LB from the pool of LBs within a single cloud-provider-domain. For more context, refer here.

MarioUhrik commented 3 years ago

I've created a proposal for the architecture of the standard Claudie LB solution. FYI @bernardhalas @miroslavkohutik @samuelstolicny @borkaz

Claudie_LB.md claudie_lb_example_input.txt

claudie_lb claudie_lb_drawio.zip

miroslavkohutik commented 3 years ago

I have finished setting up and testing the POC. Overall I would consider this POC to be a sucess. The following is a report of my work.

Two mirror infrastructures were created in GCP and Hetzner cloud as described in the image below. 132363801-727301ab-cb93-4f99-a5b1-8b31a4688b9e

A free domain claudie-test.tk with four A records targeting the four LB machines was registered. DNS-level load balancing appears to be done in a random fashion, with session affinity on some client devices (least session affinity was experienced on windows 10 devices using curl). The LB machines run Nginx in load balancing mode to balance traffic between the cluster machines. The cluster machines run basic Nginx instances with distinct index.html files. Nginx load balancing is done in a round-robin fashion as expected. When using curl or web browser, LB machine failures are undetectable from the client's perspective. As long as at least one LB is active, LB outages do not affect new client requests whatsoever (except for DNS affinity).

LB testing was also done using socat, a multipurpose relay tool. TCP and UDP ports 80 and 9090 were tested.

Example test using socat:

  1. Run socat on cluster machine to listen on IPv4 UDP port 80: socat - udp4-listen:80,reuseaddr,fork
  2. Run socat on client to send UDP packets to claudie-test.tk on port 80: socat - udp:claudie-test.tk:80 A two-way connection is established, with text data sent from client showing up on one of the cluster machine's terminal and vice-versa.
  3. Kill socat on client and repeat step 2, connection is established to a different cluster machine.

Load balancing tests using socat were successful, with both TCP and UDP traffic being load-balanced evenly between the four cluster machines. Note that socat manifests strong DNS affinity - connections originating from one client always use the same load balancing node, regardless of the node's status.

Infrastructure and other relevant files can be found here, VM IP addresses are in the ansible inventory file (log in as root).

MarioUhrik commented 2 years ago

I consider this task successfully completed, and the proposed architecture validated for implementation into Claudie.

@miroslavkohutik , on second thought, let's wait for @bernardhalas to review it as well. He can do it within the next 2 weeks.

Please remember that we should still delete the PoC infrastructure. Let's do that before moving this task to "done".

So:

FYI @borkaz

miroslavkohutik commented 2 years ago

As discussed today, I'll describe the functional tests I have performed:

Test Nginx load balancing in default configuration (Nginx instances on LB machines listen on port 80 and load balance the traffic to cluster machines, Nginx instances on cluster machines listen on port 80 and display distinct web pages).

  1. Run curl claudie-test.tk several times, each time you should get a response from a different cluster machine
  2. Kill Nginx on one of the cluster machines
  3. Run curl claudie-test.tk several times again, each time you should get a response from a different cluster machine, but only from the cluster machines with running Nginx (i.e. no Connection refused response)
  4. Repeat steps 2 and 3 until no active Nginx instances remain on cluster machines. Immediately after killing the final cluster machine Nginx instance you should start getting empty responses
  5. Revive Nginx on one or more cluster machines, immediately you should start getting responses again

Configue Nginx on LB machines to listen on a different port (e.g. 9090) and repeat the previous test with curl claudie-test.tk:<port>. Same results are expected.

Test DNS load balancing:

  1. Reconfigure Nginx on LB machines to display distinct static web pages instead of load balancing
  2. Session affinity is useful here, open claudie-test.tk in your browser or run curl claudie-test.tk several times to make sure session affinity is present. Note the LB machine that responds
  3. Kill Nginx on the responding LB machine
  4. Refresh browser tab or run curl again, you should immediately get a response from a different LB machine
  5. Repeat steps 3 and 4 until no active Nginx instances remain. Immediately after killing the final Nginx instance you should stop getting any responses
  6. Revive Nginx on one or more LB machines, immediately you should start getting responses again
MarioUhrik commented 2 years ago

Great work, @miroslavkohutik ! :tada:

bernardhalas commented 2 years ago

Thanks for the work done within this POC. Amazing job. A few remarks:

@miroslavkohutik :

Note that socat manifests strong DNS affinity - connections originating from one client always use the same load balancing node, regardless of the node's status.

What specifically do you mean by regardless of the node's status?

@samuelstolicny : As per:

So:

  • wait for Bernard's approval
  • delete the PoC infrastructure
  • move the task to the "Done" section

I think we should mark this task "Done" only once the above 2 items are completed.

miroslavkohutik commented 2 years ago

@bernardhalas thanks for asking about that particular note, on second look it's not entirely accurate. Socat picks a random DNS IP and uses it as long as the IP is reachable, even if nginx load balancer on that particular machine is down. It will pick another IP once the original IP becomes unreachable.