Unable to restart train.py after crash due to thread not being closed properly

apprenticelearner / AL_Train

A repository for the CTAT HTML based training harness for Apprentice Learner agents.

MIT License

5 stars 5 forks source link

Unable to restart train.py after crash due to thread not being closed properly #1

Open cmaclell opened 5 years ago

cmaclell commented 5 years ago

(migrated this issue over from the AL repo as it should be here instead)

Whenever there is some kind of error that causes AL to crash, then I am unable to restart it. I have to manually kill all the python processes running on my machine. Even then, I still get the error shown in the attached screenshot for 3-5 minutes afterwards.

Is this because we're doing some kind of special threading? Is this only an issue for mac?

Flagging this here as a bug.

The main issue here appears to be that when AL crashes due to an exception being raised, whatever threads get spawned are not properly closed before exiting.

As a short term fix, it seems like it is possible to manually kill all related Python processes AND close the browser window that automatically opened for AL. Then you wait maybe 30 sec before running train.py again. Kind of annoying, but the fix is workable while a fix to properly close the python processes is implemented.

cmaclell commented 5 years ago

Currently processes are killed using the atexit module, which runs registered functions when the python program is terminated. However, it has the following edge case:

Note: The functions registered via this module are not called when the program is killed by a signal not handled by Python, when a Python fatal internal error is detected, or when os._exit() is called.

cmaclell commented 5 years ago

An alternative suggested on stack overflow (https://stackoverflow.com/questions/930519/how-to-run-one-last-function-before-getting-killed-in-python) is to use the signal package.

from signal import *
import sys, time

def clean(*args):
    print "clean me"
    sys.exit(0)

for sig in (SIGABRT, SIGBREAK, SIGILL, SIGINT, SIGSEGV, SIGTERM):
    signal(sig, clean)

time.sleep(10)

cmaclell commented 5 years ago

So I tried replacing atexit with signal as specified above and it did not seem to fix the problem.

I'm not really sure what is causing the processes to stay up and the port to be blocked.

DannyWeitekamp commented 5 years ago

My solution to this is to leave the Port id blank in the net.conf. That will force it to search for an open port. Unfortunately this solution leaves a bunch of running agents which need to be killed. The issue is that django doesn’t want to die. For the host server I have an explicit QUIT request, but there is no such thing in django, because obviously the client shouldn’t be able to kill the server. Unfortunately it also doesn’t seem to respect sig TERM calls which is really annoying. I’ve been trying to figure out how to do this right for a while.

eharpste commented 4 years ago

@cmaclell is this still an issue?