ecklm / adaptive-network-slicing

Project holding the implementation and results of my thesis project at University of Trento, Italy
20 stars 5 forks source link

controller: QoS queue setting is rather slow #29

Closed ecklm closed 4 years ago

ecklm commented 4 years ago

Calls from QoSManager to the actual QoS implementing app (the name to be looked up) get slow after generally 3 subsequent calls per endpoint as presented below.

This makes the setup process and even later the adaptations destructively slow. This time-span may often result in trying to use deprecated data -- eg. setting queues in a switch that has just disconnected from the controller. Therefore, exceptions may also happen.

127.0.0.1 - - [23/Apr/2020 08:00:15] "PUT /v1.0/conf/switches/0000000000000002/ovsdb_addr HTTP/1.1" 201 144 0.004762
(24939) accepted ('127.0.0.1', 46658)
127.0.0.1 - - [23/Apr/2020 08:00:15] "POST /qos/rules/0000000000000002 HTTP/1.1" 200 247 0.001050
(24939) accepted ('127.0.0.1', 46664)
127.0.0.1 - - [23/Apr/2020 08:00:16] "POST /qos/rules/0000000000000002 HTTP/1.1" 200 247 0.000786
(24939) accepted ('127.0.0.1', 46668)
127.0.0.1 - - [23/Apr/2020 08:00:16] "POST /qos/rules/0000000000000002 HTTP/1.1" 200 247 0.000779
(24939) accepted ('127.0.0.1', 46672)
127.0.0.1 - - [23/Apr/2020 08:00:18] "POST /qos/queue/0000000000000002 HTTP/1.1" 200 387 2.120427
(24939) accepted ('127.0.0.1', 46680)
127.0.0.1 - - [23/Apr/2020 08:00:20] "POST /qos/queue/0000000000000002 HTTP/1.1" 200 387 2.135312
(24939) accepted ('127.0.0.1', 46688)
127.0.0.1 - - [23/Apr/2020 08:00:20] "PUT /v1.0/conf/switches/0000000000000003/ovsdb_addr HTTP/1.1" 201 144 0.000313
(24939) accepted ('127.0.0.1', 46694)
127.0.0.1 - - [23/Apr/2020 08:00:20] "POST /qos/rules/0000000000000003 HTTP/1.1" 200 247 0.000935
(24939) accepted ('127.0.0.1', 46700)
127.0.0.1 - - [23/Apr/2020 08:00:20] "POST /qos/rules/0000000000000003 HTTP/1.1" 200 247 0.000673
(24939) accepted ('127.0.0.1', 46704)
127.0.0.1 - - [23/Apr/2020 08:00:20] "POST /qos/rules/0000000000000003 HTTP/1.1" 200 247 0.000914
(24939) accepted ('127.0.0.1', 46708)
127.0.0.1 - - [23/Apr/2020 08:00:22] "POST /qos/queue/0000000000000003 HTTP/1.1" 200 387 2.170584
(24939) accepted ('127.0.0.1', 46716)
127.0.0.1 - - [23/Apr/2020 08:00:24] "POST /qos/queue/0000000000000003 HTTP/1.1" 200 387 2.096152

The solution may be in close connection with #3.

ecklm commented 4 years ago

Although, it does not resolve the problem, but decreasing the number of API calls to the least possible could prevent this slowing down. #26 might offer a solution to this.

ecklm commented 4 years ago

It seems that it's not requests in general but only queue setting requests. And it also doesn't seem to correlate with the number of HTTP requests. The following API call reducing many separate calls to only one also latsts for 20+ seconds, which means that the problem must be on the controller side. Supposedly the actual OFP communication and the OVS' reaction is the bottleneck.

    def set_queues(self):
        """Set queues on switches so that limits can be set on them."""
        # self.__logger.debug("qosmanager: Switchports to be configured: {}".format(ports))
        queue_limits = [QoSManager.DEFAULT_MAX_RATE] + [self.flows_limits[k][0] for k in self.flows_limits]
        r = requests.post("%s/qos/queue/all" % QoSManager.CONTROLLER_BASEURL,
                          headers={'Content-Type': 'application/json'},
                          data=json.dumps({
                              # From doc: port_name is optional argument. If does not pass the port_name argument, all
                              # ports are target for configuration.
                              "type": "linux-htb", "max_rate": str(QoSManager.DEFAULT_MAX_RATE),
                              "queues":
                                  [{"max_rate": str(limit)} for limit in queue_limits]
                          }))
        self.log_http_response(r)

In the log:

127.0.0.1 - - [28/Apr/2020 08:40:43] "POST /qos/rules/all HTTP/1.1" 200 703 0.001785
Flow limit for flow 'FlowId(ipv4_dst='10.0.0.11', udp_dst=5001)' updated to 1250000.0bps
Flow limit for flow 'FlowId(ipv4_dst='10.0.0.12', udp_dst=5002)' updated to 6250000.0bps
Flow limit for flow 'FlowId(ipv4_dst='10.0.0.13', udp_dst=5003)' updated to 3750000.0bps
(69870) accepted ('127.0.0.1', 60166)
127.0.0.1 - - [28/Apr/2020 08:41:08] "POST /qos/queue/all HTTP/1.1" 200 1394 23.042202
ecklm commented 4 years ago

This slowness seems to have nothing to do with the rest_qos application, it quite early hands the request to the controller. This problem goes down below the VSCTL level so it's likely to be a limitation in the specific OVS implementation used in the mininet VM.

ecklm commented 4 years ago

Two last options to try

  1. Try running the experiment against a newer version of Ubuntu-mininet combo.
  2. Just a workaround, run the adaptation function in a new thread so that it does not block the controller.

The non-blocking behavior would be extremely useful as the measurement data will also come from logs.

ecklm commented 4 years ago

Observations

  1. I learned that green threads have even less to do with actual multi-threading than python threads in general.
  2. The occurrence of the exception -- presented at the end of this comment -- is highly dependent on the workload of the running host -> if 3 threads are unused (using a VM with 2 threads), the exception does not occur.
  3. Setting queues on all switches (POST /qos/rules/all) spins up the one used core by the Ryu manager for whatever reason. I'd expect it's idle most of the time while waiting for the switches to respond to the queue setting request. Maybe the problem is not even in the old virtual switches but some bottleneck in the controller.

Hypothesis

When a set_queue function is called, the underlying function(s) called by the REST API handler have a timer down somewhere to prevent running an operation for a long time. I suspect, this is somewhere in the level of OpenFlow already. If I'm right, it is supposed to time out if an OpenFlow operation lasts two long. In this case, if I, say, watch a movie while experimenting, most of the cores are 90%+ used, therefore the operations last too long in the controller and on the VM => the timer expires. If I do not watch a movie or not use most of my cores for virtually anything, the operations are carried our fast enough => the timer does not expire.

This is very unfortunate and unstable, but it lies deep down in the controller and working it around would probably exceed the scope of the thesis. :(


127.0.0.1 - - [16/May/2020 22:22:08] "POST /qos/queue/all HTTP/1.1" 200 0 22.303261
Traceback (most recent call last):
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 461, in fire_timers
    timer()
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/wsgi.py", line 818, in process_request
    proto.__init__(conn_state, self)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/wsgi.py", line 357, in __init__
    self.handle()
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/wsgi.py", line 390, in handle
    self.handle_one_request()
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/wsgi.py", line 466, in handle_one_request
    self.handle_one_response()
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/wsgi.py", line 566, in handle_one_response
    result = self.application(self.environ, start_response)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/app/wsgi.py", line 236, in __call__
    return super(wsgify_hack, self).__call__(environ, start_response)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/webob/dec.py", line 129, in __call__
    resp = self.call_func(req, *args, **kw)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/webob/dec.py", line 193, in call_func
    return self.func(req, *args, **kwargs)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/app/wsgi.py", line 290, in __call__
    return controller(req)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/app/wsgi.py", line 160, in __call__
    return getattr(self, action)(req, **kwargs)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/app/rest_qos.py", line 440, in set_queue
    return self._access_switch(req, switchid, VLANID_NONE,
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/app/rest_qos.py", line 529, in _access_switch
    msg = function(rest, vid)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/app/rest_qos.py", line 650, in _rest_command
    key, value = func(*args, **kwargs)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/app/rest_qos.py", line 713, in set_queue
    self.ovs_bridge.set_qos(port_name, type=queue_type,
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/lib/ovs/bridge.py", line 536, in set_qos
    self.run_command([command_qos, command_queue])
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/lib/ovs/bridge.py", line 137, in run_command
    self.vsctl.run_command(commands, self.timeout, self.exception)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/lib/ovs/vsctl.py", line 1295, in run_command
    self._run_command(commands)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/lib/ovs/vsctl.py", line 1275, in _run_command
    self._do_main(commands)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/lib/ovs/vsctl.py", line 1181, in _do_main
    if self._do_vsctl(idl_, commands):
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ryu/lib/ovs/vsctl.py", line 1124, in _do_vsctl
    status = self.txn.commit_block()
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ovs/db/idl.py", line 1658, in commit_block
    poller.block()
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ovs/poller.py", line 231, in block
    events = self.poll.poll(self.timeout)
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/ovs/poller.py", line 137, in poll
    rlist, wlist, xlist = select.select(self.rlist,
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/green/select.py", line 80, in select
    return hub.switch()
  File "/home/ecklm/Projects/Diploma/implementation/controller/.venv/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 298, in switch
    return self.greenlet.switch()
eventlet.timeout.Timeout: 5 seconds
qos_manager: Queue setting has failed. ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))