Zeta36 / chess-alpha-zero

Chess reinforcement learning by AlphaGo Zero methods.
MIT License
2.12k stars 479 forks source link

Crash when running self_play #41

Open stef277 opened 6 years ago

stef277 commented 6 years ago

Hi,

I just cloned the repository (a few hours ago), and I ran into a crash while trying to do self-play using the best model coming with the source code. It crashes after a while (after a few minutes), it looks like the different process or threads have problems communicating with each other using pipes. I will look at it tomorrow, once I will start exploring the code a bit more. You will find my config and the stack trace below.

My config:

(venv) 2015sys0736:chess-alpha-zero stephane$ python src/chess_zero/run.py self 2017-12-24 23:55:56,014@chess_zero.manager INFO # config type: {config_type} Using TensorFlow backend. /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6 return f(*args, kwds) 2017-12-24 23:55:57,195@chess_zero.agent.model_chess DEBUG # loading model from /Users/stephane/Documents/Dev/chess/chess-alpha-zero/data/model/model_best_config.json 2017-12-24 23:55:57.237053: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2017-12-24 23:56:00,410@chess_zero.agent.model_chess DEBUG # loaded model digest = 0c379712fcb4204eccea535e5ff099cde78f87037e9805c85d4738bc350adb12 Using TensorFlow backend. Using TensorFlow backend. Using TensorFlow backend. /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6 return f(*args, *kwds) /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6 return f(args, kwds) /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6 return f(*args, *kwds) Exception in thread prediction_worker: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run self._target(self._args, **self._kwargs) File "src/chess_zero/agent/api_chess.py", line 33, in predict_batch_worker data.append(pipe.recv()) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "src/chess_zero/worker/self_play.py", line 87, in self_play_buffer pipes = cur.pop() # borrow File "", line 2, in pop File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod kind, result = conn.recv() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py", line 951, in rebuild_connection fd = df.detach() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py", line 487, in Client c = SocketClient(address) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 61] Connection refused """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "src/chess_zero/run.py", line 16, in manager.start() File "src/chess_zero/manager.py", line 46, in start return self_play.start(config) File "src/chess_zero/worker/self_play.py", line 22, in start return SelfPlayWorker(config).start() File "src/chess_zero/worker/self_play.py", line 47, in start env, data = futures.popleft().result() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 432, in result return self.get_result() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 384, in get_result raise self._exception ConnectionRefusedError: [Errno 61] Connection refused

Zeta36 commented 6 years ago

@Akababa, please can you review this? May it be caused by using Mac OS?

stef277 commented 6 years ago

I've set max_process to 1 in configs/mini.py, and it's now running. It takes about 280-620 s. per games (depending on the number of moves). I guess it's going to run for 50 games. Time to go to bed. Maybe you are right and it's because of some way my Mac is configured. (Also, I'm using virtualenv, but I don't think it matters)

Akababa commented 6 years ago

Does it help to change start method to spawn?

stef277 commented 6 years ago

It is already set for 'spawn' in main in run.py. I've tried 'fork', same result. I will look at it eventually, it seems to crash when finishing processing the game and sending result to the main process. Not a big deal for now, I guess having only one process is sufficient for now.

Akababa commented 6 years ago

Sorry I couldn't test it, I only have a windows laptop. Please let us know if you figure it out!

On another note: without tensorflow-gpu how much cpu % does the one process use?

stef277 commented 6 years ago

I have 8 cores. So 4 cores are at 100%, the other 4 are around 80%. And my fan on my laptop is screaming! How long does it take you to complete 1 game?

Akababa commented 6 years ago

I have a modest GPU (GT 750M) so a total of about one game per minute on 3 processes. Interestingly my CPU usage is only around 30-40%.