TheCrazyT / roboschool

Open-source software for robot simulation, integrated with OpenAI Gym.
Other
5 stars 0 forks source link

Roboschool Windows Crashes Upon Exit #8

Open Akz47 opened 6 years ago

Akz47 commented 6 years ago

When I exit the simulations or press Ctrl-C at the command prompt, Windows always displays a crash alert and tries to initiate a Windows Error Reporting.

I checked the windows error log, and it appears to be associated with QT5 module:

Faulting application name: python.exe, version: 3.6.1150.1013, time stamp: 0x5914acba Faulting module name: Qt5Gui.dll, version: 5.10.0.0, time stamp: 0x5a2ae6f3 Exception code: 0xc0000005 Fault offset: 0x0000000000078626 Faulting process id: 0x2328 Faulting application start time: 0x01d41a82f4c8335f Faulting application path: K:\Anaconda3\python.exe Faulting module path: L:\WINDOWS\SYSTEM32\Qt5Gui.dll Report Id: f6d73881-e892-47c0-9a5e-e2417c547b15 Faulting package full name: Faulting package-relative application ID:

I am using the following versions:

I don't think it affects the operation of the simulation, but it's weird that it crashes each time.

Thanks!

TheCrazyT commented 6 years ago

Ok i must admit it has nothing to do with anaconda or the python version. For some reason i do not get the window for the crash anymore, but i can clearly see the same fault offsets inside the windows event-history. (maybe i disabled the window once with some setting, i'm not shure) Will try to investigate it. All i know for now is that the cpp_household-library seems to execute the _ZN14QOpenGLContext14currentContextEv-function (wich itself calls _ZNK14QOpenGLContext14extraFunctionsEv) wich seems to access an invalid offset. My guess is a nullpointer because of free'd resources or something like that.

Edit: Oh well i'm quite shure it crashes on locations like here: https://github.com/TheCrazyT/roboschool/blob/master/roboschool/cpp-household/render-simple.cpp#L340

There are probably more than one line that uses QOpenGLContext::currentContext()->extraFunctions(), guess i will just add a check for a nullpointer on those locations.

Akz47 commented 6 years ago

Thank you for promptly locating the code segments that caused the null pointer errors. What are the ExtraFunctions for, so they are not needed for rendering the simulations?

Meanwhile, I've entirely disable Windows Error Reporting and the crash dialog.

Below are the steps, in case anyone wishes to do so too:

So I will be oblivious to any crashes. :)

Please let me know if and when an updated version is available.

TheCrazyT commented 6 years ago

What are the ExtraFunctions for, so they are not needed for rendering the simulations?

They are needed, but by the time some destructors are called, the QOpenGLContext is already gone. Thats a problem because QOpenGLContext::currentContext() would return no object anymore.

TheCrazyT commented 6 years ago

I think I fixed it, you can download it here: https://dl.bintray.com/thecrazyt/roboschool/0.5/ (basically you only need to download cpp_household.pyd and replace it)

Akz47 commented 6 years ago

Thank you for your fantastic speedy fix, it works perfect now without any crash!

I've re-enabled Windows Error Reporting and double checked in the Event Log to confirm that no application errors are reported.

On a separate note, I'm having a little trouble executing the multiplayer samples demo_pong.py and demo_race1.py, which make use of os.mkfifo not available on Windows:

Traceback (most recent call last): File "demo_pong.py", line 23, in gameserver = roboschool.multiplayer.SharedMemoryServer(game, "pongdemo", want_test_window=True) File "roboschool\multiplayer.py", line 257, in init player_n=n) File "roboschool\multiplayer.py", line 147, in init os.mkfifo(self.sh_pipe_actready_filename) AttributeError: module 'os' has no attribute 'mkfifo'

I tried searching around for a Windows version of this multiplayer.py but couldn't seem to find any. I saw some general (non-Roboschool) recommendations to replace os.mkfifo with os.pipe / pywin32 / ctypes, but I'm not too sure how exactly to go about doing that.

Line 147 in multiplayer.py is:

os.mkfifo(self.sh_pipe_actready_filename)

Could you please shed some light on how I can fix this for Windows execution?

Thank you once again for your kind assistance.

TheCrazyT commented 6 years ago

I tried os.pipe and failed ... After that I tried win32pipe to use named pipes ... and failed again ... Guess i can't fix it that easily, would take more time i guess.

Akz47 commented 6 years ago

No worries, thank you very much for trying. I'll experiment with the other examples that do not require the pipes first.

TheCrazyT commented 6 years ago

Alright it seems to work now with the commit: https://github.com/TheCrazyT/roboschool/commit/ecf52791f022443a110938931b02f04a5a17a824 Although i'm trying to make the project work for both operating systems, i probably break the possibility to use the same project on linux. I do not have enough time to test on both machines and since a new roboschool-version is planned it is not worth the effort to make it work on both systems i guess.

Well my solution is working, but probably can be improved. Not shure wich demo's are still failing, tested my solution with the demo_pong.py .

Edit: Not shure if it is an error but the animation seems to stop at frame 999. Currently can't figure out if this happens intensionally or by a bug. Strange thing is that i can't find any number in the source that limits to that frame. All i know is that you get an error about a closed pipe ... wich is weird because the source currently never closes the pipe (except if the "server"/python stops)

Akz47 commented 6 years ago

Thank you for updating the Roboschool to address the pipe issue.

I've updated my Roboschool copy with your 3 new files: winfifo.py, multiplayer.py and demo_pong.py, and also configured the temp directory paths.

When I run demo_pong.py, it exits with the following error without showing any animations:

Waiting tmp/multiplayer_pongdemo_player00
Waiting tmp/multiplayer_pongdemo_player01
Player 0 connected, wants to operate RoboschoolPong-v1 in this scene
Player 1 connected, wants to operate RoboschoolPong-v1 in this scene
Traceback (most recent call last):
  File "demo_pong.py", line 26, in <module>
    gameserver.serve_forever()
  File "roboschool\multiplayer.py", line 285, in serve_forever
    p.read_and_apply_action()
  File "roboschool\multiplayer.py", line 192, in read_and_apply_action
    check = self.sh_pipe_actready.readline()[:-1]
  File "roboschool\winfifo.py", line 62, in readline
    res = str(super().readline(),"UTF-8")
  File "roboschool\winfifo.py", line 54, in read
    result,data = win32file.ReadFile(self.handle,size,self.overlapped)
pywintypes.error: (109, 'ReadFile', 'The pipe has been ended.')

Is this related to your same error?

TheCrazyT commented 6 years ago

Yes it is the same error that i can't figure out. But for some reason that error does not happen at the first 999 frames so i'm seeing indeed a pong animation. First i thought it could be the internal garbage-collector of python, but since the pipe-variables are in global namespace("PIPE_HANDLES") this should not be the case.

Minimizing the window also stops the animation for some reason (can't remember what error happens if you do it).

Akz47 commented 6 years ago

I found this post about os.path.exists() interfering with the pipes and generating the same error.

Could this issue be related?

https://stackoverflow.com/questions/51255352/why-does-os-path-exists-stop-windows-named-pipes-from-connecting

TheCrazyT commented 6 years ago

No, i figured out why it failed for me:

register(
    id='RoboschoolPong-v1',
    entry_point='roboschool:RoboschoolPong',
    max_episode_steps=1000,
    tags={ "pg_complexity": 20*1000000 },
    ) 

That code is inside the init.py of the roboschool folder. max_episode_steps=1000 explains why it stops at 999 for me with that error. The client-python script finishes its execution after it finished its episode. Result is that the pipe is lost (wich is ok because the suprocess stopped) and the server throwing that pipe is broken error. This could normaly be silently ignored although i still don't get why the serve_forever function is written in a way to do more than 1 episode although the play-function in demo_pong.py finishes after 1 episode.

Edit: Now that i think about it i guess they just forgot a while True: above the call of the play-function.

TheCrazyT commented 6 years ago

Oh and i almost forgot:

also configured the temp directory paths.

this is not necessary because the paths are only virtual, i'm using named pipes and no real file on windows.

Akz47 commented 6 years ago

Thanks for the update. I updated the demo_pong.py with the "while True" line, and changed init.py's steps to 5000.

Below is what I got:

Waiting tmp/multiplayer_pongdemo_player00
Waiting tmp/multiplayer_pongdemo_player01
Player 0 connected, wants to operate RoboschoolPong-v1 in this scene
Player 1 connected, wants to operate RoboschoolPong-v1 in this scene
Traceback (most recent call last):
  File "demo_pong.py", line 26, in <module>
    gameserver.serve_forever()
  File "roboschool\multiplayer.py", line 285, in serve_forever
    p.read_and_apply_action()
  File "roboschool\multiplayer.py", line 192, in read_and_apply_action
    check = self.sh_pipe_actready.readline()[:-1]
  File "roboschool\winfifo.py", line 62, in readline
    res = str(super().readline(),"UTF-8")
  File "roboschool\winfifo.py", line 57, in read
    raise Exception("ret_code: %d" % ret_code);
Exception: ret_code: 258

The script showed the error, then returned to command line, but continued to output results like this:

40:-38 50:-46 52:-50 67:-62 53:-51 48:-44 58:-56 46:-43 58:-54 51:-47 ...

It seemed to continue indefinitely at about 1 result every 2-3 seconds for hours. Are these the expected results? However, no visual output / window is displayed.

I also tried enabling "video=True" in demo_pong, but the script will then crash with the earlier "pywintypes.error: (109, 'ReadFile', 'The pipe has been ended.')" error.

p/s: The temp directory I was referring to earlier was actually configured in multiplayer.py, which generates actual files in the system.

TheCrazyT commented 6 years ago

Alright i know what you mean the paths were no trouble for me (maybe because i have msys and cygwin installed?). Well i see that you mean the multiplayer_pongdemoplayer00 and multiplayer_pongdemoplayer01 files. Sadly the MULTIPLAYER_FILES_DIR was used for the pipe-paths as well and the ":" creates trouble. (because "\.\pipe\roboschoolC:\tmp" is no valid pipe path for example) I modified the winfifo to replace that character ...

Akz47 commented 6 years ago

Thank you for your reply. Actually I directly edited the MULTIPLAYER_FILES_DIR variable too, setting it just to "tmp", a relative path within my execution directory (which is agent_zoo). I see the "multiplayer_pongdemoplayer00*" files inside, so it seems to write correctly.

However, there is no video screen generated when I run this demo_pong. For others like RoboschoolWalker etc, an animation window is displayed.

I only keep seeing the output results like "0:-38 50:-46 52:-50 67:-62 53:-51 48:-44 58:-56 46:-43 58:-54 51:-47 ..." that seems to run indefinitely.

Is there something preventing the animation window from launching or rendering?

TheCrazyT commented 6 years ago

Do you use the current version?(winfifo.py should have the line fileName = fileName.replace(":","_")) Do you get any stacktrace? The numbers that are outputted are normaly the scores of the left and the right "pong". What is strange is that it shows big or negative values for you for some reason. Currently i have no clue why the window is not shown, its hard to debug withouth having the same problem. Maybe you could change the FIFO_DEBUG-constant to "true" (inside roboschool/winfifo.py) , post the result of the application on http://pastebin.com/ and link it here. This could help me find the problem.

Akz47 commented 6 years ago

Thanks for your pointers. Yes, I'm already using the latest winfifo.py.

Once I run it, I get the following error, but the output numbers continue to be generated in the background:

Traceback (most recent call last):
  File "demo_pong.py", line 27, in <module>
    gameserver.serve_forever()
  File "roboschool\roboschool\multiplayer.py", line 286, in serve_forever
    p.read_and_apply_action()
  File "roboschool\roboschool\multiplayer.py", line 193, in read_and_apply_action
    check = self.sh_pipe_actready.readline()[:-1]
  File "roboschool\roboschool\winfifo.py", line 62, in readline
    res = str(super().readline(),"UTF-8")
  File "roboschool\roboschool\winfifo.py", line 57, in read
    raise Exception("ret_code: %d" % ret_code);
Exception: ret_code: 258

What does this return code 258 mean?

Below is the debug information after enabling FIFO_DEBUG: https://pastebin.com/mQZgXwyB

Could the animation problem be a separate issue unrelated to the winfifo, or is the rendering disabled somewhere? If I run RoboschoolPong_v0_2017may1.py, the animation shows properly.

TheCrazyT commented 6 years ago

The code 258 means that its a timeout that happens. I setted the time to 10 seconds wich should be more than enough for the subprocesses to respond. Atleast at the beginning the communication seems to work between the subprocesses and the main process (the one that calls gameserver.serve_forever() ).

But for some reason the subprocesses do not seem to write a second time after sending their model information ("RoboschoolPong-v1"). To explain the log a little, the first number represents the following: 12316 is the processid of the main process. 11296 is the processid of the one of the "pongs". 10248 is the processid of the other "pong". It crashes when the main process waits for response from one of the sub-processes. (for some reason one of the subprocesses only write to "multiplayer_pongdemo_player00_actready" for one time)

Akz47 commented 6 years ago

Thank you for your detailed analysis.

Since there is a timeout and crash, it seems weird that the games are still being played? The results like "114:-110 112:-107 131:-123 ......" still continue to be generated even after the return code 258 is displayed.

Does that mean that only a specific sub-process timed out / crash, without affecting the main loop?

Does any of these error or log messages help diagnose the missing animation rendering?