CCI-MOC / hil

Hardware Isolation Layer, formerly Hardware as a Service
Apache License 2.0
24 stars 54 forks source link

Some way of testing switch connectivity #909

Closed naved001 closed 6 years ago

naved001 commented 7 years ago

For our pexpect based drivers, if there's no connectivity to switch, it takes like 15 minutes to timeout. This can be frustrating at times.

One solution that I have in mind is have a method in our switch drivers switch.switch_alive() that simply pings the switch to see if there's connectivity and raise an error is there's no connectivity. This method can be called by the node_{connect/disconnect}_network API before queuing stuff in the networking action queue.

I am open to other ideas too.

Would appreciate your thoughts on this @zenhack @Izhmash @knikolla
cc: @tpd001

zenhack commented 7 years ago

Can we just lower the timeout?

naved001 commented 7 years ago

hmmm; the default timeout of pexpect seems to be pretty low.

>>> console = pexpect.spawn('telnet 123.123.123.123')
>>> console.expect('Username: ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.linux-x86_64/egg/pexpect/__init__.py", line 1451, in expect
  File "build/bdist.linux-x86_64/egg/pexpect/__init__.py", line 1466, in expect_list
  File "build/bdist.linux-x86_64/egg/pexpect/__init__.py", line 1568, in expect_loop
pexpect.TIMEOUT: Timeout exceeded.

here pexpect timed out pretty quickly (around 5-7 seconds)

These are the logs from @tpd001's email.

2017-10-26 16:20:25 DEBUG http://1xxx"POST /switch/powerconnect/port/gi1/0/10/revert HTTP/1.1" 200 0 2017-10-26 16:35:54 ERROR -------------------------------------------------------------- 2017-10-26 16:35:54 ERROR HIL reservation failure: Unable to detach node slurm-compute1 from project slurm 2017-10-26 16:35:54 ERROR Exception: HTTPConnectionPool(host='1xxxxx', port=80): Read timed out. (read timeout=None) 2017-10-26 16:35:54 ERROR Traceback: [('/home/slurm/scripts/ve/lib/python2.7/site-packages/hil_slurm_client.py', 82, 'hil_reserve_nodes', 'hil_client.project.detach(project, node)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/hil/client/project.py', 54, 'detach', 'self.httpClient.request("POST", url, data=self.payload)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/hil/client/client.py', 87, 'request', 'resp = requests.Session.request(self, *args, kwargs)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/requests/sessions.py', 508, 'request', 'resp = self.send(prep, send_kwargs)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/requests/sessions.py', 618, 'send', 'r = adapter.send(request, **kwargs)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/requests/adapters.py', 521, 'send', 'raise ReadTimeout(e, request=request)')] 2017-10-26 16:35:54 ERROR -------------------------------------------------------------- 2017-10-26 16:35:54 ERROR HIL reservation failure: Unable to reserve nodes ['slurm-compute1']

From the logs, it looks like the API server successfully queued the networking operation at 16:20, but now I don't know why it waited 15 minutes before telling it was unable to detach the node.

I think what we can do is check the return value of console.expect right after spawn

>>> console = pexpect.spawn('telnet 192.xxxxxx47')
>>> console.expect('User Name:')
0
>>>

So in connect method, we can check if it's not zero and raise an error.

zenhack commented 7 years ago

Timeout on my laptop is around 30 seconds, which is still small enough to not explain what's going on.

I don't want to just patch a feature into this without actually understanding why this is happening. We have some debugging to do.

naved001 commented 6 years ago

Was playing around with some other connection issues and then I looked closely at these logs:

The timeout is set to None, in theory it should just keep trying I guess, but idk why it stops after 15 minutes.

Coming back to the original reason why this issue was opened, I think we should still have a method in the switch driver to only check for connectivity that way we don't block the networking action queue (for operations on other switches).

zenhack commented 6 years ago

The current documentation claims everything is in order; relaxing that constraint would have significant impact on the use of the api. Right now we have to block in order to provide the correct semantics.

naved001 commented 6 years ago

Why do we have to block the queue if the actions are happening on different ports?

zenhack commented 6 years ago

Because a user may assume that if they do (for example) node_detach_network(node0, eth0, net0) then node_connect_network(node0, eth1, net0), they can rely on not ever having a situation where both of those nics are on the same network. This could be important e.g. to avoid loops or somesuch.

We could just change the contract, and say to users if they want that constraint they have to wait until the action has gone through. But all of the documentation says that the ordering is something they can rely on, so we can't allow the second call to go through if the first one fails.

Relaxing that guarantee sounds reasonable to me, but we should probably make sure we actually have a way for a user to check the status of a networking action before we make that change.

If we do make that change, then we still don't need the method to check connectivity, because we can just have the networking daemon go ahead with other actions when un-related ones fail.

Re: changing the default timeout in the client library, I don't like this. There's nothing HIL specific about our use of HTTP, and besides, the client library is abstracted over the HTTP backend, so we can't actually change the underlying client without changing that interface. If a user wants a lower timeout, they can just give us an HTTPClient that's been customized to their needs.

naved001 commented 6 years ago

I am more or less convinced, so closing this issue.