CCI-MOC / obmd

OBM management microservice for use with HIL
Apache License 2.0
0 stars 5 forks source link

Console connection ends abruptly #29

Open naved001 opened 5 years ago

naved001 commented 5 years ago

This was reported by one of the users.

the console show will preemtively end with "('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))"

and this is what obmd service says:

Oct 16 16:17:10 kzn-hil-server.infra.massopen.cloud obmd[26706]: 2018/10/16 16:17:10 Error reading from console: read /dev/ptmx: file already closed
Oct 16 16:48:31 kzn-hil-server.infra.massopen.cloud obmd[26706]: 2018/10/16 16:48:31 Error reading from console: read /dev/ptmx: file already closed
Oct 17 12:36:26 kzn-hil-server.infra.massopen.cloud obmd[26706]: 2018/10/17 12:36:26 Error reading from console: read /dev/ptmx: file already closed
Oct 17 12:39:23 kzn-hil-server.infra.massopen.cloud obmd[26706]: 2018/10/17 12:39:23 Error reading from console: read /dev/ptmx: file already closed
Oct 17 12:41:23 kzn-hil-server.infra.massopen.cloud obmd[26706]: 2018/10/17 12:41:23 Error reading from console: read /dev/ptmx: file already closed
Oct 17 12:43:09 kzn-hil-server.infra.massopen.cloud obmd[26706]: 2018/10/17 12:43:09 Error reading from console: read /dev/ptmx: file already closed
Oct 17 12:47:26 kzn-hil-server.infra.massopen.cloud obmd[26706]: 2018/10/17 12:47:26 Error reading from console: read /dev/ptmx: file already closed

@zenhack I can give you access to the new kzn-hil-server which is running hil and obmd (if you need to investigate this).

zenhack commented 5 years ago

Quoting Naved Ansari (2018-10-17 14:00:40)

@zenhack I can give you access to the new kzn-hil-server which is running hil and obmd (if you need to investigate this).

This would be useful.

zenhack commented 5 years ago

I have a suspicion that ipmtiool is losing its connection to the server for some reason. My guess is the error that users are seeing is from ipmitool itself (just being copied naively by obmd like everything else), and the message in the obmd log is from attempting to read from the pipe after ipmitool has exited. I'm not sure why the connection would be dying though. How frequently is this happening?

naved001 commented 5 years ago

This would be useful.

Done.

How frequently is this happening?

Well, only one user has reported that it happens to him sometimes. I have not seen it personally.

zenhack commented 5 years ago

Curious as to how often sometimes is; I'm not sure what we can expect in terms of reliability from impitool's connection.

Can you email me details about how to connect?

zenhack commented 5 years ago

I've narrowed this down a bit: the issue is actually somewhere in the hil client lib or cli tool; doing a curl on the console URL doesn't ever time out. I will keep digging.

zenhack commented 5 years ago

Somewhere inside of urllib3 read is returning with a zero-length result, and the library is interpreting this as the connection being closed. I'm not entirely sure what's going on such that curl never times out, but the hil command line does.

naved001 commented 5 years ago

I experienced the same issue even after directly talking to the server using curl instead of the python client. It disconnected under a minute.

naved::~$curl -X GET $HIL_ENDPOINT/v0/node/neu-5-11/console --user $HIL_USERNAME:$HIL_PASSWORD -L
[SOL Session operational.  Use ~? for help]
curl: (18) transfer closed with outstanding read data remaining

I had an ssh session on the same machine, and from there I was sending some data to the serial port. (echo "hello" > /dev/ttyS0).

I did not see any output even when directly using ipmitool, but after I hit enter once in the sol session opened by ipmitool I could see the prompt and the serial output.

So I think inserting a newline or any character into the stdin of the ipmitool process might fix things.

This was tested with dell hardaware (poweredge r620)

zenhack commented 5 years ago

Can you reproduce this reliably? When I was experimenting I tried it many times and curl was always reliable. Is this the same hardware you had me using or a different model of server?

Quoting Naved Ansari (2019-02-20 17:38:48)

So I think inserting a newline or any character into the stdin of the ipmitool process might fix things.

Want to patch obmd and test this?

If that does fix the problem we should think about implications; I want to make sure the machine won't do anything unexpected if we're hitting enter at its prompts.

naved001 commented 5 years ago

Can you reproduce this reliably?

I tested this with two machines. I'll double check with two other machines of the same kind, and some other type of hardware.

Is this the same hardware you had me using or a different model of server

These are all dell poweredge R620s but some are a generation ahead (Mix of sandy bridge and ivy bridge machines). I'll gather the idrac version and other information.

Want to patch obmd and test this?

Yeah, I'll give that a try.

I want to make sure the machine won't do anything unexpected if we're hitting enter at its prompts.

That's what I am worried about too. Because when it's booting up, at various points it will ask to press some key to get into the raid controller setting, or some nic setting etc.