futurewei-cloud / Distrinet

Distributed Network emulator, based on Mininet
MIT License
3 stars 6 forks source link

Explore the upper size of Distrinet cluster #45

Open VanderChen opened 2 years ago

VanderChen commented 2 years ago

To explore the upper size of the Distrinet cluster, the following experiments are completed.

Physical resources limitation

server type server size cluster size
kvm 1 ( 1 master ) <= 500 vhosts
kvm 2 (1 master + 1 worker) < 700 vhosts
kvm 3 (1 master + 2 workers) < 800 vhosts
bare metal 3 (1 master + 2 workers) < 800 vhosts

one kvm for 64 G memory + 600G Disk HDD + 32-thread CPU one bare metal server for 600G memory + 600G Disk SSD + 88-thread CPU

In the bare metal experiment, resource utilization is below 10%.

Conclusion: Physical resources are not the cluster size bottleneck.

SSH limitation

Due to the implementation of Distrinet, the client keeps ssh connections to each vhost. The number of ssh connections may constrain the cluster scalability.

  1. The default setting of ssh limitation is 1024 which is bigger than the cluster size (800 vhost).
  2. Distrinet is able to set up more than 1000 vhost without RYU controller.

Conclusion: ssh limitation is not the bottleneck at least under the current cluster size.

SDN controller limitation

The SDN controller is currently considered to be the main bottleneck of the cluster. Some attempts are made to solve this problem.

  1. More compute resources for a single RYU controller.

Results are shown in the resource part. In short, no significant improvement.

  1. Replace by RYU k8s cluster

1 master with RYU cluster + 2 worker reachs 800 vhosts.

10 replications are set up in k8s cluster and found that there is a workload on each pod. Since RYU does not support horizontal scaling, we are not sure that this can really achieve load balancing, that is, distributed computing.

  1. Replace by ONOS k8s cluster

ONOS controller support horizontal scaling originally and the k8s cluster is set as ONOS K8S Cluster. However, we can set up a lager Distrinet cluster but ONOS cluster will crash when client execute pingall.

VanderChen commented 2 years ago

TODO:

  1. Try ONOS controller on bare metal.
  2. Try one replication ONOS cluster.
VanderChen commented 2 years ago

Weekly Updating:

Weekly Conclusion

TODO:

VanderChen commented 2 years ago

Weekly Updating:

We continuously push the size of the Distrinet cluster this week and locate the bottleneck. Too many SSH connections are created leads to three types of errors.

We analyze the source code and find that;

+ -------- +                      + --------------------- +                        + --------- +
|  client  |  --- ssh via eth --> | master (jump bastion) | --- ssh via vxlan -->  | container |
+ -------- +                      + --------------------- +                        + --------- +

To solve these problems, we are going to try two ways:

  1. Close ssh connection after command execution and re-connect before the next command.
  2. Add admin path to client host and ssh to containers directly.
+ -------- +                           + --------- + 
|  client  |  --- ssh via admin br --> | container |
+ -------- +                           + --------- +       
VanderChen commented 2 years ago

Weekly Updating:

Discussion:

The ssh connection used in Distrinet can be divided into two types.

+------------------+                  +-----------------------+
| client           | temporary ssh    | worker                |
|                  +----------------->| +--------+ +--------+ |
|                  |                  | |docker 1| |docker 2| |
|  +---------------+ persistent ssh   | +--------+ +--------+ |
|  |  Mininet CLI  +----------------->|                       |
+--+---------------+                  +-----------------------+

Original design

Both temporary and persistent ssh connections are created for each container (i.e., number of connections == 2* ( number of vhosts + number of switches)). Moreover, in the original Distrinet, these connections always keep alive.

However, the number of connections limits the size of the cluster.

Current design

However, we test based on the current design, we found that there are still too many connections. When we start a cluster with 10k vhosts, the connections created successfully at an early stage will lose connection during creating more connections. This will lead to the crash of the cluster.

A possible way to solve this is to discard Mininet CLI. The part of functions that config the normal path can be replaced by the one-off mode temporary ssh. The other part (e.g., pingall, iperf) will be replaced by executing commands in docker manually, Thus, all of the ssh connections will be in the one-off mode which will benefit the scale.

cj-chung commented 2 years ago

@VanderChen Question about the persistent ssh from Mininet CLI. Besides to execute Mininet command, what else function need this persistent ssh? If the execution for the Mininet command is the only purpose for this ssh link and we don't need to run any Mininet command in our environment, then we can discard the ssh connection for large scale deployment.

VanderChen commented 2 years ago

@VanderChen Question about the persistent ssh from Mininet CLI. Besides to execute Mininet command, what else function need this persistent ssh? If the execution for the Mininet command is the only purpose for this ssh link and we don't need to run any Mininet command in our environment, then we can discard the ssh connection for large scale deployment.

@cj-chung At the cluster starting stage, after the temporary ssh connections create the admin path, the persistent ssh connections are responsible for communicating with the ONOS controller, starting switches, and enabling the normal path. This part is integrated into Mininet and needs much effort to decouple.

After creating the normal path, the only remaining purpose is Mininet command (e.g., pingall).