Unreal Stacked LSTM Notebook Fails As-Is

nathanmargaglio commented 5 years ago

Use this template when reporting bugs, errors or unexpected behaviour. Override for general questions, feature requests, proposals etc.

Running environment:

Jupyter Notebook running Python 3.6 (Anaconda) in Ubuntu

Files or part of package has been run:

/notebooks/forex_ml/btgym/examples/unreal_stacked_lstm_strat_4_11.ipynb

Upon running the code (with little to no modification), this is the result:

[2018-08-23 20:09:48.151904] NOTICE: LauncherShell: </home/nathan/tmp/test_4_11_1> created. [2018-08-23 20:09:50.697029] NOTICE: UNREAL_0: learn_rate: 0.000100, entropy_beta: 0.050000

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] to stop training and close launcher.

[2018-08-23 20:09:54.586589] NOTICE: UNREAL_1: learn_rate: 0.000100, entropy_beta: 0.050000 WARNING:tensorflow:From /home/nathan/forex_ml/btgym/btgym/algorithms/worker.py:267: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. WARNING:tensorflow:From /home/nathan/forex_ml/btgym/btgym/algorithms/worker.py:267: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession INFO:tensorflow:Starting standard services. INFO:tensorflow:Saving checkpoint to path /home/nathan/tmp/test_4_11_1/train/model.ckpt INFO:tensorflow:Starting queue runners. INFO:tensorflow:global/global_step/sec: 0 [2018-08-23 20:10:03.596361] NOTICE: Worker_0: started training at step: 0 [2018-08-23 20:10:03.594680] NOTICE: BTgymDataServer_0: Initial global_time set to: 2017-01-01 21:01:00 / stamp: 1483322460.0 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Starting queue runners. [2018-08-23 20:10:04.774957] NOTICE: Worker_1: started training at step: 0 [2018-08-23 20:11:48.018308] ERROR: ThreadRunner_0: RunTime exception occurred.

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/nathan/forex_ml/btgym/btgym/algorithms/runner/threadrunner.py", line 92, in run self._run() File "/home/nathan/forex_ml/btgym/btgym/algorithms/runner/threadrunner.py", line 119, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/nathan/forex_ml/btgym/btgym/algorithms/runner/base.py", line 169, in BaseEnvRunnerFn episode_stat = env.get_stat() # get episode statistic File "/home/nathan/forex_ml/btgym/btgym/envs/base.py", line 833, in get_stat if self._force_control_mode(): File "/home/nathan/forex_ml/btgym/btgym/envs/base.py", line 592, in _force_control_mode self.server_response = self.socket.recv_pyobj() File "/home/nathan/anaconda3/lib/python3.6/site-packages/zmq/sugar/socket.py", line 622, in recv_pyobj msg = self.recv(flags) File "zmq/backend/cython/socket.pyx", line 790, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 826, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 193, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/socket.pyx", line 188, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc zmq.error.Again: Resource temporarily unavailable Exception in thread Thread-4: Traceback (most recent call last): File "/home/nathan/forex_ml/btgym/btgym/algorithms/runner/threadrunner.py", line 92, in run self._run() File "/home/nathan/forex_ml/btgym/btgym/algorithms/runner/threadrunner.py", line 119, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/nathan/forex_ml/btgym/btgym/algorithms/runner/base.py", line 169, in BaseEnvRunnerFn episode_stat = env.get_stat() # get episode statistic File "/home/nathan/forex_ml/btgym/btgym/envs/base.py", line 833, in get_stat if self._force_control_mode(): File "/home/nathan/forex_ml/btgym/btgym/envs/base.py", line 592, in _force_control_mode self.server_response = self.socket.recv_pyobj() File "/home/nathan/anaconda3/lib/python3.6/site-packages/zmq/sugar/socket.py", line 622, in recv_pyobj msg = self.recv(flags) File "zmq/backend/cython/socket.pyx", line 790, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 826, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 193, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/socket.pyx", line 188, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc zmq.error.Again: Resource temporarily unavailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/nathan/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/nathan/forex_ml/btgym/btgym/algorithms/runner/threadrunner.py", line 97, in run raise RuntimeError RuntimeError

INFO:tensorflow:global/global_step/sec: 71.6154 INFO:tensorflow:global/global_step/sec: 73.4852 INFO:tensorflow:Saving checkpoint to path /home/nathan/tmp/test_4_11_1/train/model.ckpt INFO:tensorflow:global/global_step/sec: 81.1107 INFO:tensorflow:global/global_step/sec: 85.6166 INFO:tensorflow:Saving checkpoint to path /home/nathan/tmp/test_4_11_1/train/model.ckpt INFO:tensorflow:global/global_step/sec: 79.3727 [2018-08-23 20:20:17.716531] ERROR: UNREAL_0: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/nathan/forex_ml/btgym/btgym/algorithms/aac.py", line 1349, in _process data = self.get_data() File "/home/nathan/forex_ml/btgym/btgym/algorithms/aac.py", line 913, in get_data data_streams = [get_it(kwargs) for get_it in self.data_getter] File "/home/nathan/forex_ml/btgym/btgym/algorithms/aac.py", line 913, in data_streams = [get_it(kwargs) for get_it in self.data_getter] File "/home/nathan/forex_ml/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/nathan/anaconda3/lib/python3.6/queue.py", line 172, in get raise Empty queue.Empty INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

I've tried to make various configuration changes (trying different training sets, reducing memory usage, etc.), but they all end up getting stuck at around the same point. You'll notice there are two (seemingly) separate issues that occur, one being zmq.error.Again: Resource temporarily unavailable (which doesn't seem to affect the running of the script), and queue.Empty, which puts an end to the script.

I'm not exactly sure where to start with this bug, but I'd be glad to dig a little deeper with some direction.

nathanmargaglio commented 5 years ago

Upon digging a little deeper, it would seem the two are related. I'm just not exactly sure how zmq works and why it is getting the Resource temporarily unavailable error.

Kismuz commented 5 years ago

@nathanmargaglio , Resource temporarily unavailable zmq error usually means the port zmq is trying to access is blocked, possibly by some orphaned process; in btgym case it maybe related to aborting previous instances of environment, usually it terminates all related processes but not always; best way to check is to terminal: lsof -i:<port number> to get PID of some process using the port and manually kill it. In case of this notebook ports of interest are: 4999, 5000, 12230

nathanmargaglio commented 5 years ago

I tried running the notebook on my work machine, and it seemed to run fine. My work machine is running similar setup to the machine having the issues, so that seems to confirm it is machine specific.

However, I've tried looking at those ports, and nothing seems to be out of place (the only processes running on them are from ZMQ, and they all disappear when the notebook stops/reappear when it starts). To add to the confusion (or possibly take it away?), I just upgraded to a fresh install of Ubuntu 18, and I'm still seeing the same issue.

After typing ^ that out, I realized that, although I had Anaconda on my work computer, my Jupyter Notebooks used the standard Python 3.6 package. So I switched to that on the problematic machine, and it seems to be running without issue.

I'm not sure what Anaconda is doing to cause these issues (or if it's something wrong with my installation), but things seem to be cleared up. I'm curious if it's a combination of Jupyter and Anaconda, but I suppose I should have been more weary of it from the start.

Do you @Kismuz have any thoughts on this? What Python version are you using to run your library and have you seen any issues with Anaconda? In any case, I'm satisfied with calling this issue closed on my end, but this issue should be noted in case someone in the future comes across something similar.

Kismuz commented 5 years ago

@nathanmargaglio , I run my notebooks from designated conda virtual environments bot on MacOS and Ubuntu and have never experienced any issues from ZMQ side. It possibly can be related to specific python kernels used by your conda environment;

Kismuz / btgym