error about queue.Empty

knn940506 commented 6 years ago

Thanks for great work :)

I have an issue while running examples - a3c_random_on_synth_or_real_data ...

I got several <INFO:tensorflow:Error reported to Coordinator: <class 'queue.Empty'>> messages and then stopped.

Is there anyway I can fix it ?? Thank you so much. Kim.

[2018-01-11 20:50:20,439] Error reported to Coordinator: <class 'queue.Empty'>, Process Worker-6: Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 241, in run trainer.process(sess) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 747, in process data = self.get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty

INFO:tensorflow:global/global_step/sec: 0

[2018-01-11 20:51:38,860] global/global_step/sec: 0

INFO:tensorflow:Error reported to Coordinator: <class 'queue.Empty'>,

[2018-01-11 20:51:48,678] Error reported to Coordinator: <class 'queue.Empty'>,

and stopped

Kismuz commented 6 years ago

@knn940506, empty queue usually means that thread runner process either dint started or quietly died. As some updates has been made since your fork @8.01.18, I recommend to update btgym package first. If error persists, please provide some details:

your setup (mainly: OS, CPU number of cores, TF version)
any modifications done to notebook( number of workers, data chosen etc.)?
if error appears at the beginning or in the course of training?
is it occasional or persistent?

knn940506 commented 6 years ago

I updated btgym by using and run again but still error occurs.

cd btgym git pull pip install --upgrade -e .

Setup : ubuntu 16.04 LTS / 8 cores/ tensorflow==1.4.1
Only changed number of workers
in the course of training
when it once appears -> stopped forever

error occurs pattern = many reset warnings -> global_step info -> error

[2018-01-12 02:01:25.982257] WARNING: BTgymServer_0: _reset kwarg not found, using default values: {'b_beta': 1, 'sample_type': 0, 'b_alpha': 1, 'get_new': True} <INFO:tensorflow:global/global_step/sec: 261.664> [2018-01-12 02:02:26.494250] ERROR: BTgymAPIshell_0: .step(): server unreachable with status: .

Thanks so much !

knn940506 commented 6 years ago

Tested other examples, looks like my workers lose Backtrader Server connection.

if program runs longer, below message always appears

~/바탕화면/git/btgym/btgym/envs/backtrader.py in _step(self, action) 748 msg = '.step(): server unreachable with status: <{}>.'.format(env_response['status']) 749 self.log.error(msg) --> 750 raise ConnectionError(msg) 751 752 self.env_response = env_response ['message']

ConnectionError: .step(): server unreachable with status: .

Kismuz commented 6 years ago

@knn940506, well, it is different error. Do the following:

at line 47 of notebook set: connect_timeout=120,
Pay attention to how you interrupt/restart notebook kernel: ( taken from #17 ): Every BTGYM instance launches at least two separate processes, not counting jupyter kernel itself:
- btgym_server as backend for environment API, default port 5000, incremented by 1 for every other env. instance: 5001, 5002, ... ;
- data_server as data providing backend for one or more btgym_server(s), default port 4999, same for all env. instances;

calling env.close() should stop both and it usually does (at least on MACOS and Linux);
interrupting parent kernel should stop childs as well, as they are not demonized, but:
there is some caveat in interrupting jupyter kernel: It can not be done via Ctrl-C, equivalent is web interface [KERNEL]-->[INTERRUPT]. This combination correctly finishes all stuff, while hitting [KERNEL]-->[RESTART] or [RESTART AND CLEAR...] for some reasons leaves child processes orphaned. In this case list processes on specified ports in terminal window:
```
lsof -i:5000
lsof -i:4999
```
...and do manual kill.

Note, that when running A3C examples there are also 12230 and 12231 to watch for.

Usually it throws errors like:

resource temporarily unavailable
could not start grpc server
server unreachable with status: ....
operation could not be accomplished in a current state

Decrease number of workers to 6. Still gives full load to CPU, can eliminate inter-threads concurrence slowdowns.
If nothing helps set Launcher kwarg verbose=3 and paste ~50 last lines of log output.

knn940506 commented 6 years ago

Error not changes... Here are some Terminal log

2018-01-15 11:19:33.222086: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-01-15 11:19:33.224407: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E0115 11:19:33.225525382 2448 ev_epoll1_linux.c:1051] grpc epoll fd: 52 E0115 11:19:33.225550836 2439 ev_epoll1_linux.c:1051] grpc epoll fd: 51 2018-01-15 11:19:33.230664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-15 11:19:33.230663: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12230} 2018-01-15 11:19:33.230714: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234} 2018-01-15 11:19:33.230717: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234} 2018-01-15 11:19:33.231020: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12231 2018-01-15 11:19:33.231497: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12230 2018-01-15 11:19:37.685478: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session b6839cbeeb119750 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2 2018-01-15 11:19:38.231957: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E0115 11:19:38.232226783 2503 ev_epoll1_linux.c:1051] grpc epoll fd: 53 2018-01-15 11:19:38.236208: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-01-15 11:19:38.236407: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-15 11:19:38.236446: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> localhost:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234} E0115 11:19:38.236568040 2507 ev_epoll1_linux.c:1051] grpc epoll fd: 54 2018-01-15 11:19:38.236800: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12232 2018-01-15 11:19:38.240948: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-15 11:19:38.240997: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> localhost:12233, 3 -> 127.0.0.1:12234} 2018-01-15 11:19:38.241403: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12233 2018-01-15 11:19:38.242178: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E0115 11:19:38.242392991 2516 ev_epoll1_linux.c:1051] grpc epoll fd: 55 2018-01-15 11:19:38.247020: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-15 11:19:38.247056: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> localhost:12234} 2018-01-15 11:19:38.247372: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12234 2018-01-15 11:19:41.789174: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 1e5bfb978931a13a with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:3/cpu:0" inter_op_parallelism_threads: 2 2018-01-15 11:19:41.807742: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 94c6cd7bd0b0fa12 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:1/cpu:0" inter_op_parallelism_threads: 2 2018-01-15 11:19:42.002243: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 23214af6a52fc7cf with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:2/cpu:0" inter_op_parallelism_threads: 2

knn940506 commented 6 years ago

There is one thing weird, I set num_worker=4 but it looks like Worker-5 is working

INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 100.832 INFO:tensorflow:Error reported to Coordinator: <class 'queue.Empty'>,

Process Worker-5: Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 241, in run trainer.process(sess) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 747, in process data = self.get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty

Is it natural?

Kismuz commented 6 years ago

@knn940506, terminal log you provided is ok, no errors there, refer to #23 for details;

No it not natural; I see that sub-processes error reporting should be somehow improved. I'll take time to see how it should be fixed.

Kismuz commented 6 years ago

@knn940506, I have updated error reporting for child processes. It does not solve error but can give a hint what's going wrong. Please update package, run example and post traceback here.

Kismuz commented 6 years ago

@knn940506 - forgot to remove exception test case, sorry for that. Corrected, update.

knn940506 commented 6 years ago

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run self._run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 222, in env_runner state, reward, terminal, info = env.step(action.argmax()) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/gym/core.py", line 96, in step return self._step(action) File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step raise ConnectionError(msg) ConnectionError: .step(): server unreachable with status: .

Exception in thread Thread-4: Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run self._run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 222, in env_runner state, reward, terminal, info = env.step(action.argmax()) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/gym/core.py", line 96, in step return self._step(action) File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step raise ConnectionError(msg) ConnectionError: .step(): server unreachable with status: .

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 95, in run raise RuntimeError RuntimeError

INFO:tensorflow:global/global_step/sec: 40.9994 INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 0 [2018-01-17 04:18:22.827845] ERROR: A3C_1: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process data = self._get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Process Worker-17: Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process data = self._get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 257, in run sv.stop() File "/home/joowonkim/anaconda3/lib/python3.5/contextlib.py", line 77, in exit self.gen.throw(type, value, traceback) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop stop_grace_period_secs=self._stop_grace_secs) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/six.py", line 693, in reraise raise value File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 954, in managed_session yield sess File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 257, in run sv.stop() File "/home/joowonkim/anaconda3/lib/python3.5/contextlib.py", line 77, in exit self.gen.throw(type, value, traceback) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 4339, in get_controller yield default File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 250, in run trainer.process(sess) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1145, in process raise RuntimeError(msg) RuntimeError: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

[2018-01-17 04:18:32.567306] ERROR: A3C_2: process() exception occurred

knn940506 commented 6 years ago

2018-01-17 04:18:53.225776] ERROR: A3C_0: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process data = self._get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

knn940506 commented 6 years ago

similar errors keep occur. Do you need moe logs?? I set env.verbos=1 and num_workers=4 Thanks !!

Kismuz commented 6 years ago

Ok, base exception occured here:

File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step raise ConnectionError(msg) ConnectionError: .step(): server unreachable with status: .

... for some reasons BTGym server did not responded to API shell in proper time; everything else are consecutive errors. This is rather strange but we can track it:

Run basic notebook example to ensure bare environment run is ok: https://github.com/Kismuz/btgym/blob/master/examples/very_basic_env_setup.ipynb If it runs without exceptions (should just print a lot of info's), than:

Change the following in a3c_random_on_synth_or_real_data... :

env_config = dict(
...
kwargs=dict(
    ....
    connect_timeout=180,
    verbose=2,
)
)
....
cluster_config = dict(
...
num_workers=1, 
num_ps=1,
num_envs=1,
....
)
.....
launcher = Launcher(
 ...
verbose=2,
)

and paste log output until error mentioned above.

knn940506 commented 6 years ago

works well at step 1.

At Jupyter Notebook

[2018-01-18 08:22:00.041878] DEBUG: BTgymServer_0: Episode countdown started at: 1393, END OF DATA, r:-0.2578244975861855 [2018-01-18 08:22:00.044134] DEBUG: BTgymServer_0: Episode countdown contd. at: 1394, CLOSE, END OF DATA, r:-0.2578244975861855 [2018-01-18 08:22:00.045461] DEBUG: BTgymServer_0: Episode countdown contd. at: 1395, CLOSE, END OF DATA, r:-0.2578244975861855 [2018-01-18 08:22:00.046319] DEBUG: BTgymServer_0: COMM recieved: {'action': 'hold'} [2018-01-18 08:22:00.046877] DEBUG: BTgymServer_0: RunStop() invoked with CLOSE, END OF DATA [2018-01-18 08:22:00.975725] DEBUG: BTgymServer_0: Episode elapsed time: 0:00:01.763553. [2018-01-18 08:23:00.106587] ERROR: ThreadRunner_0: RunTime exception occurred.

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run self._run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 263, in env_runner episode_stat = env.get_stat() # get episode statistic File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 772, in get_stat if self._force_control_mode(): File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 545, in _force_control_mode self.server_response = self.socket.recv_pyobj() File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/zmq/sugar/socket.py", line 491, in recv_pyobj msg = self.recv(flags) File "zmq/backend/cython/socket.pyx", line 693, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 727, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 150, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/socket.pyx", line 145, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc zmq.error.Again: Resource temporarily unavailable

Exception in thread Thread-4: Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run self._run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 263, in env_runner episode_stat = env.get_stat() # get episode statistic File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 772, in get_stat if self._force_control_mode(): File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 545, in _force_control_mode self.server_response = self.socket.recv_pyobj() File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/zmq/sugar/socket.py", line 491, in recv_pyobj msg = self.recv(flags) File "zmq/backend/cython/socket.pyx", line 693, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 727, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 150, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/socket.pyx", line 145, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc zmq.error.Again: Resource temporarily unavailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 95, in run raise RuntimeError RuntimeError

INFO:tensorflow:global/global_step/sec: 5.66658 INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 0 [2018-01-18 08:31:59.980364] ERROR: A3C_0: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process data = self._get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred

knn940506 commented 6 years ago

At terminal

E0118 17:21:45.308309060 19328 ev_epoll1_linux.c:1051] grpc epoll fd: 52 2018-01-18 17:21:45.312629: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12230} 2018-01-18 17:21:45.312629: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-18 17:21:45.312664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12231} 2018-01-18 17:21:45.312664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231} 2018-01-18 17:21:45.312991: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12231 2018-01-18 17:21:45.313294: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12230 2018-01-18 17:21:49.566307: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session ad2b7177ea7201bf with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2

Thanks for your works :+1: :+1:

Kismuz commented 6 years ago

@knn940506, I have corrected some unsafe code which can potentially lead to such exception. Problem is I can't verify it locally as there is no such error appeared with my work setup (MACOS).

Update btgym and run again. If error remains, create notebook in /examples directory an run following code in it:

import os
import backtrader as bt
from btgym import BTgymEnv, BTgymDataset
from btgym.strategy.observers import Reward, Position, NormPnL
from btgym.research import DevStrat_4_6

MyCerebro = bt.Cerebro()
MyCerebro.addstrategy(
    DevStrat_4_6,
    drawdown_call=5, # max % to loose, in percent of initial cash
    target_call=10,  # max % to win, same
    skip_frame=10,
)
# Set leveraged account:
MyCerebro.broker.setcash(2000)
MyCerebro.broker.setcommission(commission=0.0001, leverage=10.0) # commisssion to imitate spread
MyCerebro.addsizer(bt.sizers.SizerFix, stake=5000,)  

# Visualisations for reward, position and PnL dynamics:
MyCerebro.addobserver(Reward)
MyCerebro.addobserver(Position)
MyCerebro.addobserver(NormPnL)

MyDataset = BTgymDataset(
    #filename='./data/DAT_ASCII_EURUSD_M1_201703.csv',
    #filename='./data/DAT_ASCII_EURUSD_M1_201704.csv',
    filename='./data/test_sine_1min_period256_delta0002.csv',
    start_weekdays={0, 1, 2, 3},
    episode_duration={'days': 0, 'hours': 23, 'minutes': 55},
    start_00=False,
    time_gap={'hours': 6},
)

env_config = dict(
    class_ref=BTgymEnv,
    kwargs=dict(
        dataset=MyDataset,
        engine=MyCerebro,
        render_modes=['episode', 'human','external'],
        render_state_as_image=True,
        render_ylabel='OHL_diff.',
        render_size_episode=(12,8),
        render_size_human=(9, 4),
        render_size_state=(11, 3),
        render_dpi=75,
        port=5000,
        data_port=4999,
        verbose=1,
    )
)

# Make environment:
env = env_config['class_ref'](**env_config['kwargs'])

# Run several episodes with statistic fetches:
for episode in range(4):
    o = env.reset()
    done = False
    while not done:
        obs, reward, done, info = env.step(env.action_space.sample())
    episode_stat = env.get_stat() 
    for k, v in episode_stat.items():
        print('{}: {}'.format(k, v))

env.close()

Is any exception raised? If yes, provide feedback pls.

knn940506 commented 6 years ago

updated btgym but aac.py has error!

Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 468, in make_tensor_proto str_values = [compat.as_bytes(x) for x in proto_values] File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 468, in str_values = [compat.as_bytes(x) for x in proto_values] File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes (bytes_or_text,)) TypeError: Expected binary or unicode string, got {'trial_num': <tf.Tensor 'local/on_policy_state_in_metadata_trial_num_pl:0' shape=(?,) dtype=float32>, 'type': <tf.Tensor 'local/on_policy_state_in_metadata_type_pl:0' shape=(?,) dtype=float32>, 'first_row': <tf.Tensor 'local/on_policy_state_in_metadata_first_row_pl:0' shape=(?,) dtype=float32>, 'sample_num': <tf.Tensor 'local/on_policy_state_in_metadata_sample_num_pl:0' shape=(?,) dtype=float32>}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 492, in init self.inc_step = self.global_step.assign_add(tf.shape(pi.on_state_in[list(pi.on_state_in.keys())[0]])[0]) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 271, in shape return shape_internal(input, name, optimize=True, out_type=out_type) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 295, in shape_internal input_tensor = ops.convert_to_tensor(input) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 836, in convert_to_tensor as_ref=False) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 926, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 208, in constant value, dtype=dtype, shape=shape, verify_shape=verify_shape)) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 472, in make_tensor_proto "supported type." % (type(values), values)) TypeError: Failed to convert object of type <class 'dict'> to Tensor. Contents: {'trial_num': <tf.Tensor 'local/on_policy_state_in_metadata_trial_num_pl:0' shape=(?,) dtype=float32>, 'type': <tf.Tensor 'local/on_policy_state_in_metadata_type_pl:0' shape=(?,) dtype=float32>, 'first_row': <tf.Tensor 'local/on_policy_state_in_metadata_first_row_pl:0' shape=(?,) dtype=float32>, 'sample_num': <tf.Tensor 'local/on_policy_state_in_metadata_sample_num_pl:0' shape=(?,) dtype=float32>}. Consider casting elements to a supported type.

knn940506 commented 6 years ago

no exception raised in your new example code. here's the result.

[2018-01-19 01:55:44.229338] INFO: BTgymAPIshell_0: ...done. [2018-01-19 01:55:44.230378] INFO: BTgymAPIshell_0: Custom Cerebro class used. [2018-01-19 01:55:44.318731] INFO: BTgymServer_0: PID: 28047 [2018-01-19 01:55:45.318373] INFO: BTgymAPIshell_0: Server started, pinging tcp://127.0.0.1:5000 ... [2018-01-19 01:55:45.321071] INFO: BTgymAPIshell_0: Server seems ready with response: <{'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}> [2018-01-19 01:55:45.322550] INFO: BTgymAPIshell_0: Environment is ready. [2018-01-19 01:55:45.327601] INFO: BTgymAPIshell_0: Data domain reset() called prior to reset_data() with [possibly inconsistent] defaults. [2018-01-19 01:55:45.332980] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_0_at_2017-01-03 12:47:00>. [2018-01-19 01:55:45.337404] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_1_at_2017-01-05 02:48:00>. [2018-01-19 01:55:45.357896] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 12:47:00>. [2018-01-19 01:55:47.013175] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_2_at_2017-01-03 09:38:00>. [2018-01-19 01:55:47.025657] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-05 02:48:00>. episode: 0 length: 1380 runtime: 0:00:01.593744 [2018-01-19 01:55:48.638609] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_3_at_2017-01-03 21:30:00>. [2018-01-19 01:55:48.653948] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 09:38:00>. episode: 1 length: 1424 runtime: 0:00:01.553601 [2018-01-19 01:55:50.253536] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_4_at_2017-01-04 11:51:00>. [2018-01-19 01:55:50.264350] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 21:30:00>. episode: 2 length: 1424 runtime: 0:00:01.539417 [2018-01-19 01:55:51.793564] INFO: BTgymServer_0: Exiting. episode: 3 length: 1424 runtime: 0:00:01.394918 [2018-01-19 01:55:51.795087] INFO: BTgymAPIshell_0: Exiting. Exit code: None [2018-01-19 01:55:51.796303] INFO: BTgymDataServer_0: {'ctrl': 'Exiting.'} [2018-01-19 01:55:51.797510] INFO: BTgymAPIshell_0: {'ctrl': 'Exiting.'} Exit code: None [2018-01-19 01:55:51.798299] INFO: BTgymAPIshell_0: Environment closed.

Kismuz commented 6 years ago

That one was tricky but good it popped out. Corrected, please update and try again. I also installed Python 3.5 (as yours, maybe error is version dependant) and have run tests, but still it works on my machine.

knn940506 commented 6 years ago

Sadly, It doesn't work. Maybe error comes from other things. I'll give you feedback soon Thanks a lot :)

Kismuz commented 6 years ago

@knn940506 , I have recently implemented another type of runner that doesn't relies on queue; it can be found at btgym.algorithms.runner.synchro.BaseSynchroRunner usage can be found at MLDG implementation: https://github.com/Kismuz/btgym/tree/develop_meta_learning_gradient

Kismuz / btgym

error about queue.Empty #30