Open VanderChen opened 2 years ago
TODO:
Weekly Updating:
Weekly Conclusion
TODO:
Weekly Updating:
ulimit -n
is used to break the file descriptor limitation in last week's experiments. However, there is still an IO Error.
We continuously push the size of the Distrinet cluster this week and locate the bottleneck. Too many SSH connections are created leads to three types of errors.
Too many open files
(Solved by ulimit -n
)OSError: [Errno 5] Input/Output error
[Errno 113] Connect call failed
(from asyncssh
python module)We analyze the source code and find that;
+ -------- + + --------------------- + + --------- +
| client | --- ssh via eth --> | master (jump bastion) | --- ssh via vxlan --> | container |
+ -------- + + --------------------- + + --------- +
To solve these problems, we are going to try two ways:
+ -------- + + --------- +
| client | --- ssh via admin br --> | container |
+ -------- + + --------- +
kernel.pty.max = 10240
in /etc/sysctl.conf
.The ssh connection used in Distrinet can be divided into two types.
+------------------+ +-----------------------+
| client | temporary ssh | worker |
| +----------------->| +--------+ +--------+ |
| | | |docker 1| |docker 2| |
| +---------------+ persistent ssh | +--------+ +--------+ |
| | Mininet CLI +----------------->| |
+--+---------------+ +-----------------------+
Both temporary and persistent ssh connections are created for each container (i.e., number of connections == 2* ( number of vhosts + number of switches)). Moreover, in the original Distrinet, these connections always keep alive.
pingall
, iperf
). These connections keep alive until shutdown the cluster.However, the number of connections limits the size of the cluster.
However, we test based on the current design, we found that there are still too many connections. When we start a cluster with 10k vhosts, the connections created successfully at an early stage will lose connection during creating more connections. This will lead to the crash of the cluster.
A possible way to solve this is to discard Mininet CLI. The part of functions that config the normal path can be replaced by the one-off mode temporary ssh. The other part (e.g., pingall
, iperf
) will be replaced by executing commands in docker manually, Thus, all of the ssh connections will be in the one-off mode which will benefit the scale.
@VanderChen Question about the persistent ssh from Mininet CLI. Besides to execute Mininet command, what else function need this persistent ssh? If the execution for the Mininet command is the only purpose for this ssh link and we don't need to run any Mininet command in our environment, then we can discard the ssh connection for large scale deployment.
@VanderChen Question about the persistent ssh from Mininet CLI. Besides to execute Mininet command, what else function need this persistent ssh? If the execution for the Mininet command is the only purpose for this ssh link and we don't need to run any Mininet command in our environment, then we can discard the ssh connection for large scale deployment.
@cj-chung At the cluster starting stage, after the temporary ssh connections create the admin path, the persistent ssh connections are responsible for communicating with the ONOS controller, starting switches, and enabling the normal path. This part is integrated into Mininet and needs much effort to decouple.
After creating the normal path, the only remaining purpose is Mininet command (e.g., pingall
).
To explore the upper size of the Distrinet cluster, the following experiments are completed.
Physical resources limitation
one kvm for 64 G memory + 600G Disk HDD + 32-thread CPU one bare metal server for 600G memory + 600G Disk SSD + 88-thread CPU
In the bare metal experiment, resource utilization is below 10%.
Conclusion: Physical resources are not the cluster size bottleneck.
SSH limitation
Due to the implementation of Distrinet, the client keeps ssh connections to each vhost. The number of ssh connections may constrain the cluster scalability.
Conclusion: ssh limitation is not the bottleneck at least under the current cluster size.
SDN controller limitation
The SDN controller is currently considered to be the main bottleneck of the cluster. Some attempts are made to solve this problem.
Results are shown in the resource part. In short, no significant improvement.
1 master with RYU cluster + 2 worker reachs 800 vhosts.
10 replications are set up in k8s cluster and found that there is a workload on each pod. Since RYU does not support horizontal scaling, we are not sure that this can really achieve load balancing, that is, distributed computing.
ONOS controller support horizontal scaling originally and the k8s cluster is set as ONOS K8S Cluster. However, we can set up a lager Distrinet cluster but ONOS cluster will crash when client execute
pingall
.