Closed naved001 closed 6 years ago
Can we just lower the timeout?
hmmm; the default timeout of pexpect seems to be pretty low.
>>> console = pexpect.spawn('telnet 123.123.123.123')
>>> console.expect('Username: ')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.linux-x86_64/egg/pexpect/__init__.py", line 1451, in expect
File "build/bdist.linux-x86_64/egg/pexpect/__init__.py", line 1466, in expect_list
File "build/bdist.linux-x86_64/egg/pexpect/__init__.py", line 1568, in expect_loop
pexpect.TIMEOUT: Timeout exceeded.
here pexpect timed out pretty quickly (around 5-7 seconds)
These are the logs from @tpd001's email.
2017-10-26 16:20:25 DEBUG http://1xxx"POST /switch/powerconnect/port/gi1/0/10/revert HTTP/1.1" 200 0 2017-10-26 16:35:54 ERROR -------------------------------------------------------------- 2017-10-26 16:35:54 ERROR HIL reservation failure: Unable to detach node slurm-compute1 from project slurm 2017-10-26 16:35:54 ERROR Exception: HTTPConnectionPool(host='1xxxxx', port=80): Read timed out. (read timeout=None) 2017-10-26 16:35:54 ERROR Traceback: [('/home/slurm/scripts/ve/lib/python2.7/site-packages/hil_slurm_client.py', 82, 'hil_reserve_nodes', 'hil_client.project.detach(project, node)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/hil/client/project.py', 54, 'detach', 'self.httpClient.request("POST", url, data=self.payload)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/hil/client/client.py', 87, 'request', 'resp = requests.Session.request(self, *args, kwargs)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/requests/sessions.py', 508, 'request', 'resp = self.send(prep, send_kwargs)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/requests/sessions.py', 618, 'send', 'r = adapter.send(request, **kwargs)'), ('/home/slurm/scripts/ve/lib/python2.7/site-packages/requests/adapters.py', 521, 'send', 'raise ReadTimeout(e, request=request)')] 2017-10-26 16:35:54 ERROR -------------------------------------------------------------- 2017-10-26 16:35:54 ERROR HIL reservation failure: Unable to reserve nodes ['slurm-compute1']
From the logs, it looks like the API server successfully queued the networking operation at 16:20, but now I don't know why it waited 15 minutes before telling it was unable to detach the node.
I think what we can do is check the return value of console.expect right after spawn
>>> console = pexpect.spawn('telnet 192.xxxxxx47')
>>> console.expect('User Name:')
0
>>>
So in connect
method, we can check if it's not zero and raise an error.
Timeout on my laptop is around 30 seconds, which is still small enough to not explain what's going on.
I don't want to just patch a feature into this without actually understanding why this is happening. We have some debugging to do.
Was playing around with some other connection issues and then I looked closely at these logs:
The timeout is set to None, in theory it should just keep trying I guess, but idk why it stops after 15 minutes.
We should, in the client library, handle this case and maybe change the timeout to something reasonable.
And I should probably look, why my test HIL server wasn't responding within a reasonable time frame
Coming back to the original reason why this issue was opened, I think we should still have a method in the switch driver to only check for connectivity that way we don't block the networking action queue (for operations on other switches).
The current documentation claims everything is in order; relaxing that constraint would have significant impact on the use of the api. Right now we have to block in order to provide the correct semantics.
Why do we have to block the queue if the actions are happening on different ports?
Because a user may assume that if they do (for example) node_detach_network(node0, eth0, net0)
then node_connect_network(node0, eth1, net0)
, they can rely on not ever having a situation where both of those nics are on the same network. This could be important e.g. to avoid loops or somesuch.
We could just change the contract, and say to users if they want that constraint they have to wait until the action has gone through. But all of the documentation says that the ordering is something they can rely on, so we can't allow the second call to go through if the first one fails.
Relaxing that guarantee sounds reasonable to me, but we should probably make sure we actually have a way for a user to check the status of a networking action before we make that change.
If we do make that change, then we still don't need the method to check connectivity, because we can just have the networking daemon go ahead with other actions when un-related ones fail.
Re: changing the default timeout in the client library, I don't like this. There's nothing HIL specific about our use of HTTP, and besides, the client library is abstracted over the HTTP backend, so we can't actually change the underlying client without changing that interface. If a user wants a lower timeout, they can just give us an HTTPClient that's been customized to their needs.
I am more or less convinced, so closing this issue.
For our pexpect based drivers, if there's no connectivity to switch, it takes like 15 minutes to timeout. This can be frustrating at times.
One solution that I have in mind is have a method in our switch drivers
switch.switch_alive()
that simply pings the switch to see if there's connectivity and raise an error is there's no connectivity. This method can be called by thenode_{connect/disconnect}_network
API before queuing stuff in the networking action queue.I am open to other ideas too.
Would appreciate your thoughts on this @zenhack @Izhmash @knikolla
cc: @tpd001