Explore the upper size of Distrinet cluster

VanderChen commented 2 years ago

To explore the upper size of the Distrinet cluster, the following experiments are completed.

Physical resources limitation

server type	server size	cluster size
kvm	1 ( 1 master )	<= 500 vhosts
kvm	2 (1 master + 1 worker)	< 700 vhosts
kvm	3 (1 master + 2 workers)	< 800 vhosts
bare metal	3 (1 master + 2 workers)	< 800 vhosts

one kvm for 64 G memory + 600G Disk HDD + 32-thread CPU one bare metal server for 600G memory + 600G Disk SSD + 88-thread CPU

In the bare metal experiment, resource utilization is below 10%.

Conclusion: Physical resources are not the cluster size bottleneck.

SSH limitation

Due to the implementation of Distrinet, the client keeps ssh connections to each vhost. The number of ssh connections may constrain the cluster scalability.

The default setting of ssh limitation is 1024 which is bigger than the cluster size (800 vhost).
Distrinet is able to set up more than 1000 vhost without RYU controller.

Conclusion: ssh limitation is not the bottleneck at least under the current cluster size.

SDN controller limitation

The SDN controller is currently considered to be the main bottleneck of the cluster. Some attempts are made to solve this problem.

More compute resources for a single RYU controller.

Results are shown in the resource part. In short, no significant improvement.

Replace by RYU k8s cluster

1 master with RYU cluster + 2 worker reachs 800 vhosts.

10 replications are set up in k8s cluster and found that there is a workload on each pod. Since RYU does not support horizontal scaling, we are not sure that this can really achieve load balancing, that is, distributed computing.

Replace by ONOS k8s cluster

ONOS controller support horizontal scaling originally and the k8s cluster is set as ONOS K8S Cluster. However, we can set up a lager Distrinet cluster but ONOS cluster will crash when client execute pingall.

VanderChen commented 2 years ago

TODO:

Try ONOS controller on bare metal.
Try one replication ONOS cluster.

VanderChen commented 2 years ago

Weekly Updating:

Cluster reached 1k vHosts with a single ONOS controller on bare metal.
A tree topology with 1296 vHosts and a fat-tree topo with 1024 vHosts are used. (7.17)
2.5k vHosts reached after adding wait-time for configuring vxlan interface and containers. (7.18)
4k vHosts reached with tree topology (4 layers and 8 children per node) (7.19)

Weekly Conclusion

The Distrinet cluster could perform better on tree topology than on fat-tree topology which needs more switches.
A 4k-vhosts cluster needs about 8 hours to start.
The following modification have been made.
- Use ONOS controller.
- Add waiting time for configuring vxlan interface and containers.
- Modify Dockerfile to auto start sshd service.
- Remove master from worker list.

TODO:

Modify file descriptor limitation

VanderChen commented 2 years ago

Weekly Updating:

ulimit -n is used to break the file descriptor limitation in last week's experiments. However, there is still an IO Error.

We continuously push the size of the Distrinet cluster this week and locate the bottleneck. Too many SSH connections are created leads to three types of errors.

Too many open files (Solved by ulimit -n)
OSError: [Errno 5] Input/Output error
[Errno 113] Connect call failed (from asyncssh python module)

We analyze the source code and find that;

Once a ssh tunnel is created, it will keep the connection until the cluster shut down.
The connection from the client to containers will use a jump bastion. Thus, at least (2 * number of vhosts ) connections will create.

+ -------- +                      + --------------------- +                        + --------- +
|  client  |  --- ssh via eth --> | master (jump bastion) | --- ssh via vxlan -->  | container |
+ -------- +                      + --------------------- +                        + --------- +

To solve these problems, we are going to try two ways:

Close ssh connection after command execution and re-connect before the next command.
Add admin path to client host and ssh to containers directly.

+ -------- +                           + --------- + 
|  client  |  --- ssh via admin br --> | container |
+ -------- +                           + --------- +

VanderChen commented 2 years ago

Weekly Updating:

We added the admin path to the client and enabled the client ssh to containers directly.
We changed the system limitation of pty numbers by setting kernel.pty.max = 10240 in /etc/sysctl.conf.
(8.3) The size of cluster reached 8k vhosts + 420 switches (tree topology with 3 layers and 20 children per node). Additional question: After the Distrinet cluster starting (8k vhosts), the latency of the first ping package from part vhosts is 58-112 ms.

Discussion:

The ssh connection used in Distrinet can be divided into two types.

+------------------+                  +-----------------------+
| client           | temporary ssh    | worker                |
|                  +----------------->| +--------+ +--------+ |
|                  |                  | |docker 1| |docker 2| |
|  +---------------+ persistent ssh   | +--------+ +--------+ |
|  |  Mininet CLI  +----------------->|                       |
+--+---------------+                  +-----------------------+

Original design

Both temporary and persistent ssh connections are created for each container (i.e., number of connections == 2* ( number of vhosts + number of switches)). Moreover, in the original Distrinet, these connections always keep alive.

The temporary ssh is used to create containers, config Linux bridge (admin-br) in worker hosts, and attach containers to the admin bridge. After executing commands for building admin path, these ssh connections keep alive and never been used until cluster shutdown.
The persistent ssh is used by Mininet CLI. These ssh connections are responsible for creating normal path and executing Mininet CLI's command (e.g., pingall, iperf). These connections keep alive until shutdown the cluster.

However, the number of connections limits the size of the cluster.

Current design

For temporary ssh which only executes commands at starting, we have changed the keep-alive ssh connection to one-off connections. That is, we close the ssh immediately after executing one command and re-open it before the next use. (Implemented)
For persistent ssh connections, we keep the original design.

However, we test based on the current design, we found that there are still too many connections. When we start a cluster with 10k vhosts, the connections created successfully at an early stage will lose connection during creating more connections. This will lead to the crash of the cluster.

A possible way to solve this is to discard Mininet CLI. The part of functions that config the normal path can be replaced by the one-off mode temporary ssh. The other part (e.g., pingall, iperf) will be replaced by executing commands in docker manually, Thus, all of the ssh connections will be in the one-off mode which will benefit the scale.

cj-chung commented 2 years ago

@VanderChen Question about the persistent ssh from Mininet CLI. Besides to execute Mininet command, what else function need this persistent ssh? If the execution for the Mininet command is the only purpose for this ssh link and we don't need to run any Mininet command in our environment, then we can discard the ssh connection for large scale deployment.

VanderChen commented 2 years ago

@VanderChen Question about the persistent ssh from Mininet CLI. Besides to execute Mininet command, what else function need this persistent ssh? If the execution for the Mininet command is the only purpose for this ssh link and we don't need to run any Mininet command in our environment, then we can discard the ssh connection for large scale deployment.

@cj-chung At the cluster starting stage, after the temporary ssh connections create the admin path, the persistent ssh connections are responsible for communicating with the ONOS controller, starting switches, and enabling the normal path. This part is integrated into Mininet and needs much effort to decouple.

After creating the normal path, the only remaining purpose is Mininet command (e.g., pingall).

futurewei-cloud / Distrinet