Kismuz / btgym

Scalable, event-driven, deep-learning-friendly backtesting library
https://kismuz.github.io/btgym/
GNU Lesser General Public License v3.0
985 stars 260 forks source link

error about queue.Empty #30

Closed knn940506 closed 6 years ago

knn940506 commented 6 years ago

Thanks for great work :)

I have an issue while running examples - a3c_random_on_synth_or_real_data ...

I got several <INFO:tensorflow:Error reported to Coordinator: <class 'queue.Empty'>> messages and then stopped.

Is there anyway I can fix it ?? Thank you so much. Kim.


[2018-01-11 20:50:20,439] Error reported to Coordinator: <class 'queue.Empty'>, Process Worker-6: Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 241, in run trainer.process(sess) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 747, in process data = self.get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty

INFO:tensorflow:global/global_step/sec: 0

[2018-01-11 20:51:38,860] global/global_step/sec: 0

INFO:tensorflow:Error reported to Coordinator: <class 'queue.Empty'>,

[2018-01-11 20:51:48,678] Error reported to Coordinator: <class 'queue.Empty'>,


and stopped

Kismuz commented 6 years ago

@knn940506, empty queue usually means that thread runner process either dint started or quietly died. As some updates has been made since your fork @8.01.18, I recommend to update btgym package first. If error persists, please provide some details:

knn940506 commented 6 years ago

I updated btgym by using and run again but still error occurs.


cd btgym git pull pip install --upgrade -e .


error occurs pattern = many reset warnings -> global_step info -> error

[2018-01-12 02:01:25.982257] WARNING: BTgymServer_0: _reset kwarg not found, using default values: {'b_beta': 1, 'sample_type': 0, 'b_alpha': 1, 'get_new': True} <INFO:tensorflow:global/global_step/sec: 261.664> [2018-01-12 02:02:26.494250] ERROR: BTgymAPIshell_0: .step(): server unreachable with status: .

Thanks so much !

knn940506 commented 6 years ago

Tested other examples, looks like my workers lose Backtrader Server connection.

if program runs longer, below message always appears


~/바탕화면/git/btgym/btgym/envs/backtrader.py in _step(self, action) 748 msg = '.step(): server unreachable with status: <{}>.'.format(env_response['status']) 749 self.log.error(msg) --> 750 raise ConnectionError(msg) 751 752 self.env_response = env_response ['message']

ConnectionError: .step(): server unreachable with status: .


Kismuz commented 6 years ago

@knn940506, well, it is different error. Do the following:

  1. at line 47 of notebook set: connect_timeout=120,

  2. Pay attention to how you interrupt/restart notebook kernel: ( taken from #17 ): Every BTGYM instance launches at least two separate processes, not counting jupyter kernel itself:

    • btgym_server as backend for environment API, default port 5000, incremented by 1 for every other env. instance: 5001, 5002, ... ;
    • data_server as data providing backend for one or more btgym_server(s), default port 4999, same for all env. instances;

Note, that when running A3C examples there are also 12230 and 12231 to watch for.

Usually it throws errors like:

  1. Decrease number of workers to 6. Still gives full load to CPU, can eliminate inter-threads concurrence slowdowns.

  2. If nothing helps set Launcher kwarg verbose=3 and paste ~50 last lines of log output.

knn940506 commented 6 years ago

Error not changes... Here are some Terminal log


2018-01-15 11:19:33.222086: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-01-15 11:19:33.224407: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E0115 11:19:33.225525382 2448 ev_epoll1_linux.c:1051] grpc epoll fd: 52 E0115 11:19:33.225550836 2439 ev_epoll1_linux.c:1051] grpc epoll fd: 51 2018-01-15 11:19:33.230664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-15 11:19:33.230663: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12230} 2018-01-15 11:19:33.230714: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234} 2018-01-15 11:19:33.230717: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234} 2018-01-15 11:19:33.231020: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12231 2018-01-15 11:19:33.231497: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12230 2018-01-15 11:19:37.685478: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session b6839cbeeb119750 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2 2018-01-15 11:19:38.231957: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E0115 11:19:38.232226783 2503 ev_epoll1_linux.c:1051] grpc epoll fd: 53 2018-01-15 11:19:38.236208: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-01-15 11:19:38.236407: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-15 11:19:38.236446: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> localhost:12232, 2 -> 127.0.0.1:12233, 3 -> 127.0.0.1:12234} E0115 11:19:38.236568040 2507 ev_epoll1_linux.c:1051] grpc epoll fd: 54 2018-01-15 11:19:38.236800: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12232 2018-01-15 11:19:38.240948: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-15 11:19:38.240997: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> localhost:12233, 3 -> 127.0.0.1:12234} 2018-01-15 11:19:38.241403: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12233 2018-01-15 11:19:38.242178: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E0115 11:19:38.242392991 2516 ev_epoll1_linux.c:1051] grpc epoll fd: 55 2018-01-15 11:19:38.247020: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-15 11:19:38.247056: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233, 3 -> localhost:12234} 2018-01-15 11:19:38.247372: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12234 2018-01-15 11:19:41.789174: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 1e5bfb978931a13a with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:3/cpu:0" inter_op_parallelism_threads: 2 2018-01-15 11:19:41.807742: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 94c6cd7bd0b0fa12 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:1/cpu:0" inter_op_parallelism_threads: 2 2018-01-15 11:19:42.002243: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 23214af6a52fc7cf with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:2/cpu:0" inter_op_parallelism_threads: 2


knn940506 commented 6 years ago

There is one thing weird, I set num_worker=4 but it looks like Worker-5 is working

INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 100.832 INFO:tensorflow:Error reported to Coordinator: <class 'queue.Empty'>,

Process Worker-5: Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 241, in run trainer.process(sess) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 747, in process data = self.get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 594, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty

Is it natural?

Kismuz commented 6 years ago

@knn940506, terminal log you provided is ok, no errors there, refer to #23 for details;

No it not natural; I see that sub-processes error reporting should be somehow improved. I'll take time to see how it should be fixed.

Kismuz commented 6 years ago

@knn940506, I have updated error reporting for child processes. It does not solve error but can give a hint what's going wrong. Please update package, run example and post traceback here.

Kismuz commented 6 years ago

@knn940506 - forgot to remove exception test case, sorry for that. Corrected, update.

knn940506 commented 6 years ago

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run self._run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 222, in env_runner state, reward, terminal, info = env.step(action.argmax()) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/gym/core.py", line 96, in step return self._step(action) File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step raise ConnectionError(msg) ConnectionError: .step(): server unreachable with status: .

Exception in thread Thread-4: Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run self._run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 222, in env_runner state, reward, terminal, info = env.step(action.argmax()) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/gym/core.py", line 96, in step return self._step(action) File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step raise ConnectionError(msg) ConnectionError: .step(): server unreachable with status: .

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 95, in run raise RuntimeError RuntimeError

INFO:tensorflow:global/global_step/sec: 40.9994 INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 0 [2018-01-17 04:18:22.827845] ERROR: A3C_1: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process data = self._get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Process Worker-17: Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process data = self._get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 257, in run sv.stop() File "/home/joowonkim/anaconda3/lib/python3.5/contextlib.py", line 77, in exit self.gen.throw(type, value, traceback) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop stop_grace_period_secs=self._stop_grace_secs) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/six.py", line 693, in reraise raise value File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 954, in managed_session yield sess File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 257, in run sv.stop() File "/home/joowonkim/anaconda3/lib/python3.5/contextlib.py", line 77, in exit self.gen.throw(type, value, traceback) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 4339, in get_controller yield default File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/worker.py", line 250, in run trainer.process(sess) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1145, in process raise RuntimeError(msg) RuntimeError: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

[2018-01-17 04:18:32.567306] ERROR: A3C_2: process() exception occurred

knn940506 commented 6 years ago

2018-01-17 04:18:53.225776] ERROR: A3C_0: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process data = self._get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

knn940506 commented 6 years ago

similar errors keep occur. Do you need moe logs?? I set env.verbos=1 and num_workers=4 Thanks !!

Kismuz commented 6 years ago

Ok, base exception occured here:

File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 750, in _step raise ConnectionError(msg) ConnectionError: .step(): server unreachable with status: .

... for some reasons BTGym server did not responded to API shell in proper time; everything else are consecutive errors. This is rather strange but we can track it:

  1. Run basic notebook example to ensure bare environment run is ok: https://github.com/Kismuz/btgym/blob/master/examples/very_basic_env_setup.ipynb If it runs without exceptions (should just print a lot of info's), than:
  2. Change the following in a3c_random_on_synth_or_real_data... :
    env_config = dict(
    ...
    kwargs=dict(
        ....
        connect_timeout=180,
        verbose=2,
    )
    )
    ....
    cluster_config = dict(
    ...
    num_workers=1, 
    num_ps=1,
    num_envs=1,
    ....
    )
    .....
    launcher = Launcher(
     ...
    verbose=2,
    )

and paste log output until error mentioned above.

knn940506 commented 6 years ago

works well at step 1.

At Jupyter Notebook


[2018-01-18 08:22:00.041878] DEBUG: BTgymServer_0: Episode countdown started at: 1393, END OF DATA, r:-0.2578244975861855 [2018-01-18 08:22:00.044134] DEBUG: BTgymServer_0: Episode countdown contd. at: 1394, CLOSE, END OF DATA, r:-0.2578244975861855 [2018-01-18 08:22:00.045461] DEBUG: BTgymServer_0: Episode countdown contd. at: 1395, CLOSE, END OF DATA, r:-0.2578244975861855 [2018-01-18 08:22:00.046319] DEBUG: BTgymServer_0: COMM recieved: {'action': 'hold'} [2018-01-18 08:22:00.046877] DEBUG: BTgymServer_0: RunStop() invoked with CLOSE, END OF DATA [2018-01-18 08:22:00.975725] DEBUG: BTgymServer_0: Episode elapsed time: 0:00:01.763553. [2018-01-18 08:23:00.106587] ERROR: ThreadRunner_0: RunTime exception occurred.

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run self._run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 263, in env_runner episode_stat = env.get_stat() # get episode statistic File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 772, in get_stat if self._force_control_mode(): File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 545, in _force_control_mode self.server_response = self.socket.recv_pyobj() File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/zmq/sugar/socket.py", line 491, in recv_pyobj msg = self.recv(flags) File "zmq/backend/cython/socket.pyx", line 693, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 727, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 150, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/socket.pyx", line 145, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc zmq.error.Again: Resource temporarily unavailable

Exception in thread Thread-4: Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 90, in run self._run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 117, in _run self.queue.put(next(rollout_provider), timeout=600.0) File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 263, in env_runner episode_stat = env.get_stat() # get episode statistic File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 772, in get_stat if self._force_control_mode(): File "/home/joowonkim/바탕화면/git/btgym/btgym/envs/backtrader.py", line 545, in _force_control_mode self.server_response = self.socket.recv_pyobj() File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/zmq/sugar/socket.py", line 491, in recv_pyobj msg = self.recv(flags) File "zmq/backend/cython/socket.pyx", line 693, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 727, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 150, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/socket.pyx", line 145, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc zmq.error.Again: Resource temporarily unavailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/runner.py", line 95, in run raise RuntimeError RuntimeError

INFO:tensorflow:global/global_step/sec: 5.66658 INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:global/global_step/sec: 0 INFO:tensorflow:Saving checkpoint to path /home/joowonkim/tmp/test_gym_a3c/train/model.ckpt INFO:tensorflow:global/global_step/sec: 0 [2018-01-18 08:31:59.980364] ERROR: A3C_0: process() exception occurred

Press Ctrl-C or jupyter:[Kernel]->[Interrupt] for clean exit.

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 1076, in process data = self._get_data() File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in _get_data data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 634, in data_streams = [get_it() for get_it in self.data_getter] File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/rollout.py", line 33, in pull_rollout_from_queue return queue.get(timeout=600.0) File "/home/joowonkim/anaconda3/lib/python3.5/queue.py", line 172, in get raise Empty queue.Empty INFO:tensorflow:Error reported to Coordinator: <class 'RuntimeError'>, process() exception occurred


knn940506 commented 6 years ago

At terminal


E0118 17:21:45.308309060 19328 ev_epoll1_linux.c:1051] grpc epoll fd: 52 2018-01-18 17:21:45.312629: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12230} 2018-01-18 17:21:45.312629: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2018-01-18 17:21:45.312664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12231} 2018-01-18 17:21:45.312664: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231} 2018-01-18 17:21:45.312991: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12231 2018-01-18 17:21:45.313294: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12230 2018-01-18 17:21:49.566307: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session ad2b7177ea7201bf with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2


Thanks for your works :+1: :+1:

Kismuz commented 6 years ago

@knn940506, I have corrected some unsafe code which can potentially lead to such exception. Problem is I can't verify it locally as there is no such error appeared with my work setup (MACOS).

Update btgym and run again. If error remains, create notebook in /examples directory an run following code in it:

import os
import backtrader as bt
from btgym import BTgymEnv, BTgymDataset
from btgym.strategy.observers import Reward, Position, NormPnL
from btgym.research import DevStrat_4_6

MyCerebro = bt.Cerebro()
MyCerebro.addstrategy(
    DevStrat_4_6,
    drawdown_call=5, # max % to loose, in percent of initial cash
    target_call=10,  # max % to win, same
    skip_frame=10,
)
# Set leveraged account:
MyCerebro.broker.setcash(2000)
MyCerebro.broker.setcommission(commission=0.0001, leverage=10.0) # commisssion to imitate spread
MyCerebro.addsizer(bt.sizers.SizerFix, stake=5000,)  

# Visualisations for reward, position and PnL dynamics:
MyCerebro.addobserver(Reward)
MyCerebro.addobserver(Position)
MyCerebro.addobserver(NormPnL)

MyDataset = BTgymDataset(
    #filename='./data/DAT_ASCII_EURUSD_M1_201703.csv',
    #filename='./data/DAT_ASCII_EURUSD_M1_201704.csv',
    filename='./data/test_sine_1min_period256_delta0002.csv',
    start_weekdays={0, 1, 2, 3},
    episode_duration={'days': 0, 'hours': 23, 'minutes': 55},
    start_00=False,
    time_gap={'hours': 6},
)

env_config = dict(
    class_ref=BTgymEnv,
    kwargs=dict(
        dataset=MyDataset,
        engine=MyCerebro,
        render_modes=['episode', 'human','external'],
        render_state_as_image=True,
        render_ylabel='OHL_diff.',
        render_size_episode=(12,8),
        render_size_human=(9, 4),
        render_size_state=(11, 3),
        render_dpi=75,
        port=5000,
        data_port=4999,
        verbose=1,
    )
)

# Make environment:
env = env_config['class_ref'](**env_config['kwargs'])

# Run several episodes with statistic fetches:
for episode in range(4):
    o = env.reset()
    done = False
    while not done:
        obs, reward, done, info = env.step(env.action_space.sample())
    episode_stat = env.get_stat() 
    for k, v in episode_stat.items():
        print('{}: {}'.format(k, v))

env.close()

Is any exception raised? If yes, provide feedback pls.

knn940506 commented 6 years ago

updated btgym but aac.py has error!

Traceback (most recent call last): File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 468, in make_tensor_proto str_values = [compat.as_bytes(x) for x in proto_values] File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 468, in str_values = [compat.as_bytes(x) for x in proto_values] File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes (bytes_or_text,)) TypeError: Expected binary or unicode string, got {'trial_num': <tf.Tensor 'local/on_policy_state_in_metadata_trial_num_pl:0' shape=(?,) dtype=float32>, 'type': <tf.Tensor 'local/on_policy_state_in_metadata_type_pl:0' shape=(?,) dtype=float32>, 'first_row': <tf.Tensor 'local/on_policy_state_in_metadata_first_row_pl:0' shape=(?,) dtype=float32>, 'sample_num': <tf.Tensor 'local/on_policy_state_in_metadata_sample_num_pl:0' shape=(?,) dtype=float32>}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/joowonkim/바탕화면/git/btgym/btgym/algorithms/aac.py", line 492, in init self.inc_step = self.global_step.assign_add(tf.shape(pi.on_state_in[list(pi.on_state_in.keys())[0]])[0]) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 271, in shape return shape_internal(input, name, optimize=True, out_type=out_type) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 295, in shape_internal input_tensor = ops.convert_to_tensor(input) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 836, in convert_to_tensor as_ref=False) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 926, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 208, in constant value, dtype=dtype, shape=shape, verify_shape=verify_shape)) File "/home/joowonkim/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 472, in make_tensor_proto "supported type." % (type(values), values)) TypeError: Failed to convert object of type <class 'dict'> to Tensor. Contents: {'trial_num': <tf.Tensor 'local/on_policy_state_in_metadata_trial_num_pl:0' shape=(?,) dtype=float32>, 'type': <tf.Tensor 'local/on_policy_state_in_metadata_type_pl:0' shape=(?,) dtype=float32>, 'first_row': <tf.Tensor 'local/on_policy_state_in_metadata_first_row_pl:0' shape=(?,) dtype=float32>, 'sample_num': <tf.Tensor 'local/on_policy_state_in_metadata_sample_num_pl:0' shape=(?,) dtype=float32>}. Consider casting elements to a supported type.

knn940506 commented 6 years ago

no exception raised in your new example code. here's the result.

[2018-01-19 01:55:44.229338] INFO: BTgymAPIshell_0: ...done. [2018-01-19 01:55:44.230378] INFO: BTgymAPIshell_0: Custom Cerebro class used. [2018-01-19 01:55:44.318731] INFO: BTgymServer_0: PID: 28047 [2018-01-19 01:55:45.318373] INFO: BTgymAPIshell_0: Server started, pinging tcp://127.0.0.1:5000 ... [2018-01-19 01:55:45.321071] INFO: BTgymAPIshell_0: Server seems ready with response: <{'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}> [2018-01-19 01:55:45.322550] INFO: BTgymAPIshell_0: Environment is ready. [2018-01-19 01:55:45.327601] INFO: BTgymAPIshell_0: Data domain reset() called prior to reset_data() with [possibly inconsistent] defaults. [2018-01-19 01:55:45.332980] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_0_at_2017-01-03 12:47:00>. [2018-01-19 01:55:45.337404] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_1_at_2017-01-05 02:48:00>. [2018-01-19 01:55:45.357896] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 12:47:00>. [2018-01-19 01:55:47.013175] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_2_at_2017-01-03 09:38:00>. [2018-01-19 01:55:47.025657] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-05 02:48:00>. episode: 0 length: 1380 runtime: 0:00:01.593744 [2018-01-19 01:55:48.638609] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_3_at_2017-01-03 21:30:00>. [2018-01-19 01:55:48.653948] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 09:38:00>. episode: 1 length: 1424 runtime: 0:00:01.553601 [2018-01-19 01:55:50.253536] INFO: SimpleDataSet_0: New sample id: <train_trial_w_0_num_4_at_2017-01-04 11:51:00>. [2018-01-19 01:55:50.264350] INFO: Trial_0: New sample id: <train_episode_w_0_num_0_at_2017-01-03 21:30:00>. episode: 2 length: 1424 runtime: 0:00:01.539417 [2018-01-19 01:55:51.793564] INFO: BTgymServer_0: Exiting. episode: 3 length: 1424 runtime: 0:00:01.394918 [2018-01-19 01:55:51.795087] INFO: BTgymAPIshell_0: Exiting. Exit code: None [2018-01-19 01:55:51.796303] INFO: BTgymDataServer_0: {'ctrl': 'Exiting.'} [2018-01-19 01:55:51.797510] INFO: BTgymAPIshell_0: {'ctrl': 'Exiting.'} Exit code: None [2018-01-19 01:55:51.798299] INFO: BTgymAPIshell_0: Environment closed.

Kismuz commented 6 years ago

That one was tricky but good it popped out. Corrected, please update and try again. I also installed Python 3.5 (as yours, maybe error is version dependant) and have run tests, but still it works on my machine.

knn940506 commented 6 years ago

Sadly, It doesn't work. Maybe error comes from other things. I'll give you feedback soon Thanks a lot :)

Kismuz commented 6 years ago

@knn940506 , I have recently implemented another type of runner that doesn't relies on queue; it can be found at btgym.algorithms.runner.synchro.BaseSynchroRunner usage can be found at MLDG implementation: https://github.com/Kismuz/btgym/tree/develop_meta_learning_gradient