LLNL / pyorick

Run yorick from python
BSD 2-Clause "Simplified" License
3 stars 3 forks source link

Cannot quit yorick session properly #4

Closed Clemovski closed 10 months ago

Clemovski commented 11 months ago

Goodmorning,

I'm using the latest version of pyorick and yorick 2.2.04x. My problem is yorick cannot quit properly and throws an error every time I use Pyorick. (Quitting from a Yorick session itself does not cause any problem. Only from Python.) Here is a sample from a Python session and its associated error. Is there something I can do ?

Thank you.

Python 3.8.10 (default, Jun  2 2021, 10:49:15) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyorick import *
>>> p=Yorick()
>>> p.call.quit()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/.local/lib/python3.8/site-packages/pyorick/pyorick.py", line 360, in __call__
    self.bare._reqrep(ID_SUBCALL, self.name, *args, **kwargs)
  File "/home/xxx/.local/lib/python3.8/site-packages/pyorick/pyorick.py", line 214, in _reqrep
    reply = reply.decode()
  File "/home/xxx/.local/lib/python3.8/site-packages/pyorick/pyorick.py", line 686, in decode
    return codec.idtable[self.packets[0][0]].decoder(self)
  File "/home/xxx/.local/lib/python3.8/site-packages/pyorick/pyorick.py", line 938, in narray
    return msg.packets[pos]
IndexError: list index out of range
>>>
dhmunro commented 11 months ago

I can't reproduce this (python 3.8.16, yorick 2.2.04x). You'll need to do some more debugging.

Since you got p=Yorick() the first thing to try is p.debug() However, this will call p.reqrep which is broken for you, so it probably won't work. You can try

p.v.pydebug=1 p._debug=True

If these succeed, you should get a bit more output on what happens with other commands like p.c.quit(). If that doesn't work, you have to do a lot more work to figure out why the pipes between python and yorick are not working for you. This probably requires stepping through your python call with pdb.run("p.call.quit()") to see exactly what is being sent and received through the yorick pipes. The pyorick.PipeProcess class is where most of this happens, so you're going to have to understand at least this much of the python. On the yorick side, you may need to edit pyorick.i0 to turn on pydebug; possibly at the bottom of the pyorick function, which is the command p.Yorick() sends, and which is the last thing that apparently succeeded.

There are five file descriptors that potentially have traffic: yorick's stdin, stdout, and stderr (which should be connected to python's stdout so you just see it), plus the binary input and output pipes python creates before starting yorick. The protocol is hand-rolled, and I don't remember it well enough to have any suggestions. Hopefully it is simple enough that you can figure out what is broken for you. I'm running on a Linux Mint 21.2 machine (more or less Ubuntu 22.04) with Anaconda python and the latest yorick from github.com/LLNL and it just works for me.

dhmunro commented 11 months ago

I forgot: One thing to always try whenever you have any problem with yorick is to remove any startup code you may have in your ~/.yorick or ~/yorick directory. (In other words, if you have a custom.i file or an i-start/ subdirectory, move them somewhere that yorick does not run them when it starts.) It is easily possible to break any yorick package if you have any non-standard code that runs when it starts. Just move your whole startup directory out of the way if you have one. If not, you're back to the previous comment.

Clemovski commented 11 months ago

Thank you very much for your answer.

I removed every file at startup which are not in the default ones. So nothing in ~/yorick or ~/.yorick.

I managed to capture some debug information when the quit command works (it sometimes does) and when it doesn't. See below ...

>>> p=Yorick()
>>> p.debug(1)
>>> p.call.quit()
P>send0: nolf=False text=pyorick;

P>reqrep: request=[37  4]
P>send: 16 bytes sent
P>send: 4 bytes sent
P>send: 16 bytes sent
P>reqrep: blocking for reply...
 Y>_pyorick_get: blocking...
 Y>_pyorick_get: got message [37,4]
 Y>_pyorick_get: blocking...
 Y>_pyorick_get: got message [21,0]
 Y>_pyorick_wait: got message [37,4]
 Y>pyorick request is:
"quit"
P>recv: 16 bytes
P>reqrep: reply=[21 -1]

So everything is fine here. Same session just after ...

>>> p=Yorick()
>>> p.debug(1)
>>> p.call.quit()
P>send0: nolf=False text=pyorick;

P>reqrep: request=[37  4]
P>send: 16 bytes sent
P>send: 4 bytes sent
P>send: 16 bytes sent
P>reqrep: blocking for reply...
 Y>_pyorick_get: blocking...
 Y>_pyorick_get: got message [37,4]
 Y>_pyorick_get: blocking...
 Y>_pyorick_get: got message [21,0]
 Y>_pyorick_wait: got message [37,4]
 Y>pyorick request is:
"quit"

P>echo_pty: prompt=PYORICK-QUIT> 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/.local/lib/python3.8/site-packages/pyorick/pyorick.py", line 360, in __call__
    self.bare._reqrep(ID_SUBCALL, self.name, *args, **kwargs)
  File "/home/xxx/.local/lib/python3.8/site-packages/pyorick/pyorick.py", line 214, in _reqrep
    reply = reply.decode()
  File "/home/xxx/.local/lib/python3.8/site-packages/pyorick/pyorick.py", line 686, in decode
    return codec.idtable[self.packets[0][0]].decoder(self)
  File "/home/xxx/.local/lib/python3.8/site-packages/pyorick/pyorick.py", line 938, in narray
    return msg.packets[pos]
IndexError: list index out of range

I retrieved the values of msg.packets : [array([0, 0])]

and pos : 1

I can print other python values from the errors if you think it could be useful.

dhmunro commented 11 months ago

Looks like it is probably some kind of race condition among all of the sockets - exactly what order the events on yorick's stdout and _pyorick_wfd fille descriptors get handled. Since it's intermittent even for you, a race might not be possible to debug any more than you've done... I'll stare at the code to see if I can imagine how the two writes yorick does in one order might arrive in python in the opposite order.

Meanwhile, does this happen only with the quit command? If everything else works, then the stupid workaround is to simply write your own kill command - say p.kill() inside a python try block which catches this error and ignores it. I'm guessing the yorick process is actually gone, so that ignoring the problem probably does no harm. I'm not sure what state the file descriptors wind up in, but if python cleans them up when your p object is deleted, this dumb fix would let you move forward.

On the other hand, if pyorick is broken for commands other than quit, what you've found is a more serious bug with no obvious workaround. So is quit the only non-working command?

dhmunro commented 11 months ago

I just committed a change that should fix the problem you are having with the quit command. However, it only applies to the specific case in which yorick has terminated as a result of the command you sent. If anything else did not work, this won't fix it. Let me know if it works for you now.

Clemovski commented 11 months ago

The kill command works but doesn't let Yorick shut down calmly, leading to problems further for me. I currently use quit inside a try catch like you suggested, but it leaves pyoricks processes running, which I have to kill manually later. (Actually a syntax error on my part) Currently it is the only command which presents this problem for me.

Thank you very much for your fix, I will try this and keep you informed ...

Clemovski commented 11 months ago

I just tried your update and it seems to do it ! I can recognize the cases where it should crash because quit() spawns an extra new line.

Thank you very much again !

dhmunro commented 10 months ago

Good. Closing this issue.