Closed signalprime closed 4 years ago
Today I tested this with a TF 1.14.0 docker -- tensorflow/tensorflow:1.14.0-gpu-py3-jupyter The results were the same as in colab, where the workers exit and the cell never finishes.
I'm stuck, has anyone else seen and solved this already?
It can happen with software configuration on CPU only, and machines with GPUs.
@greg234234,
This is what my current env is running with: tensorflow : 1.12.0 opencv-python : 3.4.4.19 gym[atari] : 0.1.7 backtrader : 1.9.69.122 pyzmq : 17.1.2 matplotlib : 2.0.2 pillow : 3.1.2 numpy : 1.16.4 scipy : 1.3.0 pandas : 0.23.4 ipython : 7.2.0 psutil : 5.4.8 logbook : 1.4.1
This works for me, you can use it as reference. For sure you can also update some of them, but it won't work if you update to tensorflow 2.0 :)
So I tried this on two machines, one was a tensorflow docker, the other just a plain CPU system. Installing the requirements, fresh reboot, and it still hangs on both machines. Small note, opencv and backtrader had issues with those versions. Also I had to upgrade gym because it was missing some called elements. Here's the log:
</root/tmp/gps> already exists. Override[y/n]? y
[2019-11-11 13:19:55.770059] NOTICE: LauncherShell: files in: /root/tmp/gps purged.
[2019-11-11 13:20:00.712999] NOTICE: GuidedA3C_0: learn_rate: 0.000100, entropy_beta: 0.010000
********************************************************************************************
** Press `Ctrl-C` or jupyter:[Kernel]->[Interrupt] to stop training and close launcher. **
********************************************************************************************
[2019-11-11 13:20:03.835307] NOTICE: GuidedA3C_0: guided_lambda: 1.000000, guided_decay_steps: 10000000
[2019-11-11 13:20:04.483018] NOTICE: GuidedA3C_3: learn_rate: 0.000100, entropy_beta: 0.010000
[2019-11-11 13:20:04.511662] NOTICE: GuidedA3C_2: learn_rate: 0.000100, entropy_beta: 0.010000
[2019-11-11 13:20:04.518251] NOTICE: GuidedA3C_1: learn_rate: 0.000100, entropy_beta: 0.010000
[2019-11-11 13:20:07.205794] NOTICE: Worker_0: initializing all parameters...
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[2019-11-11 13:20:07.869133] NOTICE: GuidedA3C_3: guided_lambda: 1.000000, guided_decay_steps: 10000000
[2019-11-11 13:20:07.887962] NOTICE: GuidedA3C_2: guided_lambda: 1.000000, guided_decay_steps: 10000000
[2019-11-11 13:20:07.905571] NOTICE: GuidedA3C_1: guided_lambda: 1.000000, guided_decay_steps: 10000000
[2019-11-11 13:20:07.908683] NOTICE: Worker_0: no saved model parameters found in:
/root/tmp/gps/current_train_checkpoint
[2019-11-11 13:20:07.909364] NOTICE: Worker_0: training from scratch...
[2019-11-11 13:20:08.989870] NOTICE: BTgymDataServer_0: Initial global_time set to: 2016-12-31 18:00:00 / stamp: 1483228800.0
[2019-11-11 13:20:09.407626] NOTICE: synchro_Runner_0: started collecting data.
[2019-11-11 13:20:09.471785] NOTICE: Worker_0: started training at step: 0
[2019-11-11 13:20:11.357331] NOTICE: Worker_2: initializing all parameters...
[2019-11-11 13:20:11.526986] NOTICE: Worker_3: initializing all parameters...
[2019-11-11 13:20:11.588964] NOTICE: Worker_1: initializing all parameters...
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
[2019-11-11 13:20:12.171765] NOTICE: Worker_2: no saved model parameters found in:
/root/tmp/gps/current_train_checkpoint
[2019-11-11 13:20:12.172810] NOTICE: Worker_2: training from scratch...
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
[2019-11-11 13:20:12.291240] NOTICE: Worker_1: no saved model parameters found in:
/root/tmp/gps/current_train_checkpoint
[2019-11-11 13:20:12.292176] NOTICE: Worker_1: training from scratch...
[2019-11-11 13:20:12.329718] NOTICE: Worker_3: no saved model parameters found in:
/root/tmp/gps/current_train_checkpoint
[2019-11-11 13:20:12.330467] NOTICE: Worker_3: training from scratch...
[2019-11-11 13:20:13.680856] NOTICE: synchro_Runner_2: started collecting data.
[2019-11-11 13:20:13.743361] NOTICE: Worker_2: started training at step: 80
[2019-11-11 13:20:13.882107] NOTICE: synchro_Runner_1: started collecting data.
[2019-11-11 13:20:13.885610] NOTICE: synchro_Runner_3: started collecting data.
[2019-11-11 13:20:13.943844] NOTICE: Worker_3: started training at step: 100
[2019-11-11 13:20:13.947857] NOTICE: Worker_1: started training at step: 100
[2019-11-11 13:20:49.442763] NOTICE: Worker_1: reached 10013 steps, exiting.
[2019-11-11 13:20:49.443457] NOTICE: Worker_3: reached 10048 steps, exiting.
[2019-11-11 13:20:49.459466] NOTICE: Worker_2: reached 10033 steps, exiting.
[2019-11-11 13:20:50.495601] NOTICE: Worker_0: reached 10057 steps, exiting.
It does not end this process and display graphs, it simply stays at this point
The model seems to be working fine. To see the results and graphs you need to open tensorboad
Yes, tensorboard works fine. Have you seen when the workers exit, then matplot graphs from backtrader? That suddenly stopped happening and I'm struggling to isolate why.
In other words, the backtest to show training results never happens
you don't need to wait until the training loop ends to see how well the model perform. open tensorboard to see your model result while it trains.
Here's a copy of the chart I'm searching for. This appears on one of my machines, and the workers join after I close the chart. Perhaps it's related to python3-tk.
The interesting thing is that this is running tensorflow-gpu 1.13.2 and confirmed to access the GPUs properly. However, with this configuration, the backtrader chart appears without a moment of training, even with steps set as high as 100 million.
However, in a docker environment on the same machine... training can continue for a long time without the backtrader chart ever appearing. Still searching for clarity
From my docker, and guided_a3c.ipynb This cell never finishes
launcher.run()
[2019-12-15 16:00:11.392438] NOTICE: LauncherShell: </root/tmp/gps> created.
./data/test_sine_1min_period256_delta0002.csv
[2019-12-15 16:00:16.409990] NOTICE: GuidedA3C_0: learn_rate: 0.000100, entropy_beta: 0.010000
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:58: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/policy/stacked_lstm.py:189: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/layers.py:34: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:144: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:159: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py:1370: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
********************************************************************************************
** Press `Ctrl-C` or jupyter:[Kernel]->[Interrupt] to stop training and close launcher. **
********************************************************************************************
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/losses/losses_impl.py:667: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
[2019-12-15 16:00:19.656170] NOTICE: GuidedA3C_0: guided_lambda: 1.000000, guided_decay_steps: 10000000
[2019-12-15 16:00:20.161365] NOTICE: GuidedA3C_1: learn_rate: 0.000100, entropy_beta: 0.010000
[2019-12-15 16:00:20.186929] NOTICE: GuidedA3C_3: learn_rate: 0.000100, entropy_beta: 0.010000
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
[2019-12-15 16:00:20.197379] NOTICE: GuidedA3C_2: learn_rate: 0.000100, entropy_beta: 0.010000
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:58: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:58: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:58: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/policy/stacked_lstm.py:189: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/policy/stacked_lstm.py:189: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/policy/stacked_lstm.py:189: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/layers.py:34: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/layers.py:34: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:144: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:144: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:159: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/layers.py:34: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py:1370: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:159: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py:1370: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:144: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
WARNING:tensorflow:From /workspace/btgym/btgym/algorithms/nn/networks.py:159: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py:1370: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
[2019-12-15 16:00:22.997544] NOTICE: Worker_0: initializing all parameters...
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/losses/losses_impl.py:667: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
INFO:tensorflow:Running local_init_op.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/losses/losses_impl.py:667: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/losses/losses_impl.py:667: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
[2019-12-15 16:00:23.732355] NOTICE: GuidedA3C_1: guided_lambda: 1.000000, guided_decay_steps: 10000000
INFO:tensorflow:Done running local_init_op.
[2019-12-15 16:00:23.813876] NOTICE: GuidedA3C_2: guided_lambda: 1.000000, guided_decay_steps: 10000000
[2019-12-15 16:00:23.827208] NOTICE: GuidedA3C_3: guided_lambda: 1.000000, guided_decay_steps: 10000000
[2019-12-15 16:00:23.880390] NOTICE: Worker_0: no saved model parameters found in:
/root/tmp/gps/current_train_checkpoint
[2019-12-15 16:00:23.881161] NOTICE: Worker_0: training from scratch...
[2019-12-15 16:00:25.417811] NOTICE: BTgymDataServer_0: Initial global_time set to: 2017-01-01 00:00:00 / stamp: 1483228800.0
[2019-12-15 16:00:25.929226] NOTICE: synchro_Runner_0: started collecting data.
[2019-12-15 16:00:26.003848] NOTICE: Worker_0: started training at step: 0
[2019-12-15 16:00:27.147585] NOTICE: Worker_1: initializing all parameters...
[2019-12-15 16:00:27.217875] NOTICE: Worker_3: initializing all parameters...
[2019-12-15 16:00:27.344270] NOTICE: Worker_2: initializing all parameters...
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[2019-12-15 16:00:28.006344] NOTICE: Worker_1: no saved model parameters found in:
/root/tmp/gps/current_train_checkpoint
[2019-12-15 16:00:28.007138] NOTICE: Worker_1: training from scratch...
[2019-12-15 16:00:28.076117] NOTICE: Worker_3: no saved model parameters found in:
/root/tmp/gps/current_train_checkpoint
[2019-12-15 16:00:28.077098] NOTICE: Worker_3: training from scratch...
INFO:tensorflow:Done running local_init_op.
[2019-12-15 16:00:28.205010] NOTICE: Worker_2: no saved model parameters found in:
/root/tmp/gps/current_train_checkpoint
[2019-12-15 16:00:28.205715] NOTICE: Worker_2: training from scratch...
[2019-12-15 16:00:29.675000] NOTICE: synchro_Runner_1: started collecting data.
[2019-12-15 16:00:29.733863] NOTICE: synchro_Runner_3: started collecting data.
[2019-12-15 16:00:29.766420] NOTICE: Worker_1: started training at step: 0
[2019-12-15 16:00:29.808007] NOTICE: synchro_Runner_2: started collecting data.
[2019-12-15 16:00:29.820500] NOTICE: Worker_3: started training at step: 0
[2019-12-15 16:00:29.885976] NOTICE: Worker_2: started training at step: 0
[2019-12-15 16:00:32.328945] NOTICE: Worker_3: reached 140 steps, exiting.
[2019-12-15 16:00:32.401396] NOTICE: Worker_1: reached 160 steps, exiting.
[2019-12-15 16:00:32.454336] NOTICE: Worker_2: reached 160 steps, exiting.
[2019-12-15 16:00:32.716426] NOTICE: Worker_0: reached 100 steps, exiting.
I'm starting to think that maybe the project is no longer active. If it is active, I'd love to help solve this and contribute the fixed docker
@greg234234 , this project is definitely in 'on_hold' mode (from my side) since I'm entirely involved in commercial crypto-arbitrage project. Meanwhile, since people keep using btgym - any contribution is valuable and will be greatly appreciated; that is especially true with docker since I have little hands-on experience with containers.
Thanks for responding @Kismuz, that's awesome you're working on a commercial project.
Can I ask about the standard behavior? In this thread, it seems to revolve around signal.pause() in algorithms/launcher/base.py and a signal is never received despite the workers exiting.
Should I be looking deeper? I only know the basics about threading, but I don't think I can comment this out, because a pause for the user would certainly be handled differently.
Signal docs: https://docs.python.org/3/library/signal.html
Here's the detailed log, I hit control-c to break and finish.
[2019-12-20 22:33:49.656496] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 3)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.656796] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'close'}}
[2019-12-20 22:33:49.665137] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 2)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.665436] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'sell'}}
[2019-12-20 22:33:49.676527] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 2)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.676806] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'sell'}}
[2019-12-20 22:33:49.685793] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 2)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.686130] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'sell'}}
[2019-12-20 22:33:49.695647] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 3)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.695941] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'close'}}
[2019-12-20 22:33:49.705886] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 2)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.706155] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'sell'}}
[2019-12-20 22:33:49.714727] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 3)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.715016] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'close'}}
[2019-12-20 22:33:49.726477] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 2)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.726760] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'sell'}}
[2019-12-20 22:33:49.737754] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 3)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.738067] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'close'}}
[2019-12-20 22:33:49.754157] DEBUG: BTgymAPIshell_0: got action: OrderedDict([('default_asset', 3)]) as <class 'collections.OrderedDict'>
[2019-12-20 22:33:49.754467] DEBUG: BTgymServer_0: COMM received: {'action': {'default_asset': 'close'}}
[2019-12-20 22:33:49.767724] DEBUG: GuidedA3C_0: Got rollout episode. type: False, trial_type: False, is_train: True
[2019-12-20 22:33:49.885278] DEBUG: BTgymAPIshell_0: close.call()
[2019-12-20 22:33:49.885655] DEBUG: BTgymServer_0: COMM received: {'ctrl': '_done'}
[2019-12-20 22:33:49.885776] DEBUG: BTgymServer_0: RunStop() invoked with -
[2019-12-20 22:33:49.889593] DEBUG: BTgymAPIshell_0: FORCE CONTROL MODE attempt: 1.
Response: _DONE SIGNAL RECEIVED
[2019-12-20 22:33:50.415831] DEBUG: BTgymServer_0: Episode run finished.
[2019-12-20 22:33:50.867414] DEBUG: BTgymAPIshell_1: close.call()
[2019-12-20 22:33:50.867780] DEBUG: BTgymServer_1: COMM received: {'ctrl': '_done'}
[2019-12-20 22:33:50.868088] DEBUG: BTgymServer_1: RunStop() invoked with -
[2019-12-20 22:33:50.868217] DEBUG: BTgymAPIshell_1: FORCE CONTROL MODE attempt: 1.
Response: _DONE SIGNAL RECEIVED
[2019-12-20 22:33:50.868934] DEBUG: BTgymServer_1: Episode run finished.
[2019-12-20 22:33:50.869005] DEBUG: BTgymServer_1: Episode elapsed time: 0:00:03.705595.
[2019-12-20 22:33:50.941080] DEBUG: BTgymAPIshell_2: close.call()
[2019-12-20 22:33:50.941466] DEBUG: BTgymServer_2: COMM received: {'ctrl': '_done'}
[2019-12-20 22:33:50.941595] DEBUG: BTgymServer_2: RunStop() invoked with -
[2019-12-20 22:33:50.942231] DEBUG: BTgymAPIshell_2: FORCE CONTROL MODE attempt: 1.
Response: _DONE SIGNAL RECEIVED
[2019-12-20 22:33:50.942627] DEBUG: BTgymServer_2: Episode run finished.
[2019-12-20 22:33:50.942720] DEBUG: BTgymServer_2: Episode elapsed time: 0:00:03.632974.
[2019-12-20 22:33:50.980549] DEBUG: BTgymServer_1: Control mode: received <{'ctrl': '_done'}>
[2019-12-20 22:33:50.980656] DEBUG: BTgymServer_1: Control mode: sent: {'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}
[2019-12-20 22:33:50.980874] DEBUG: BTgymAPIshell_1: FORCE CONTROL MODE attempt: 2.
Response: {'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}
[2019-12-20 22:33:50.981050] DEBUG: BTgymServer_1: Control mode: received <{'ctrl': '_stop'}>
[2019-12-20 22:33:50.981110] INFO: BTgymServer_1: Exiting.
[2019-12-20 22:33:50.981245] INFO: BTgymAPIshell_1: Exiting. Exit code: None
[2019-12-20 22:33:50.981845] INFO: BTgymAPIshell_1: Environment closed.
[2019-12-20 22:33:50.982056] NOTICE: Worker_1: reached 160 steps, exiting.
[2019-12-20 22:33:51.001019] DEBUG: BTgymAPIshell_3: close.call()
[2019-12-20 22:33:51.001378] DEBUG: BTgymServer_3: COMM received: {'ctrl': '_done'}
[2019-12-20 22:33:51.001492] DEBUG: BTgymServer_3: RunStop() invoked with -
[2019-12-20 22:33:51.001543] DEBUG: BTgymAPIshell_3: FORCE CONTROL MODE attempt: 1.
Response: _DONE SIGNAL RECEIVED
[2019-12-20 22:33:51.002355] DEBUG: BTgymServer_3: Episode run finished.
[2019-12-20 22:33:51.002429] DEBUG: BTgymServer_3: Episode elapsed time: 0:00:03.539661.
[2019-12-20 22:33:51.068820] DEBUG: BTgymServer_2: Control mode: received <{'ctrl': '_done'}>
[2019-12-20 22:33:51.068918] DEBUG: BTgymServer_2: Control mode: sent: {'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}
[2019-12-20 22:33:51.069191] DEBUG: BTgymAPIshell_2: FORCE CONTROL MODE attempt: 2.
Response: {'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}
[2019-12-20 22:33:51.069571] DEBUG: BTgymServer_2: Control mode: received <{'ctrl': '_stop'}>
[2019-12-20 22:33:51.069626] INFO: BTgymServer_2: Exiting.
[2019-12-20 22:33:51.070094] INFO: BTgymAPIshell_2: Exiting. Exit code: None
[2019-12-20 22:33:51.071513] INFO: BTgymAPIshell_2: Environment closed.
[2019-12-20 22:33:51.071612] NOTICE: Worker_2: reached 160 steps, exiting.
[2019-12-20 22:33:51.108952] DEBUG: BTgymServer_3: Control mode: received <{'ctrl': '_done'}>
[2019-12-20 22:33:51.109047] DEBUG: BTgymServer_3: Control mode: sent: {'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}
[2019-12-20 22:33:51.109335] DEBUG: BTgymAPIshell_3: FORCE CONTROL MODE attempt: 2.
Response: {'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}
[2019-12-20 22:33:51.109515] DEBUG: BTgymServer_3: Control mode: received <{'ctrl': '_stop'}>
[2019-12-20 22:33:51.109590] INFO: BTgymServer_3: Exiting.
[2019-12-20 22:33:51.109747] INFO: BTgymAPIshell_3: Exiting. Exit code: None
[2019-12-20 22:33:51.110205] INFO: BTgymAPIshell_3: Environment closed.
[2019-12-20 22:33:51.110279] NOTICE: Worker_3: reached 160 steps, exiting.
[2019-12-20 22:33:51.360642] DEBUG: BTgymAPIshell_0: Episode rendering done.
[2019-12-20 22:33:51.361357] DEBUG: BTgymServer_0: Episode elapsed time: 0:00:07.744226.
[2019-12-20 22:33:51.464019] DEBUG: BTgymServer_0: Control mode: received <{'ctrl': '_done'}>
[2019-12-20 22:33:51.464121] DEBUG: BTgymServer_0: Control mode: sent: {'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}
[2019-12-20 22:33:51.464346] DEBUG: BTgymAPIshell_0: FORCE CONTROL MODE attempt: 2.
Response: {'ctrl': 'send control keys: <_reset>, <_getstat>, <_render>, <_stop>.'}
[2019-12-20 22:33:51.464511] DEBUG: BTgymServer_0: Control mode: received <{'ctrl': '_stop'}>
[2019-12-20 22:33:51.464564] INFO: BTgymServer_0: Exiting.
[2019-12-20 22:33:51.464722] INFO: BTgymAPIshell_0: Exiting. Exit code: None
[2019-12-20 22:33:51.465057] DEBUG: BTgymDataServer_0: Received <{'ctrl': '_stop'}>
[2019-12-20 22:33:51.465131] INFO: BTgymDataServer_0: {'ctrl': 'Exiting.'}
[2019-12-20 22:33:51.465306] INFO: BTgymAPIshell_0: {'ctrl': 'Exiting.'} Exit code: None
[2019-12-20 22:33:51.465495] INFO: BTgymAPIshell_0: Environment closed.
[2019-12-20 22:33:51.465560] NOTICE: Worker_0: reached 100 steps, exiting.
^C[2019-12-20 22:34:09.757120] NOTICE: LauncherShell: [greg] Waiting for each worker to finish.
[2019-12-20 22:34:09.757466] NOTICE: LauncherShell: worker_1 has joined.
[2019-12-20 22:34:09.757660] NOTICE: LauncherShell: worker_2 has joined.
[2019-12-20 22:34:09.757820] NOTICE: LauncherShell: worker_3 has joined.
[2019-12-20 22:34:09.757991] NOTICE: LauncherShell: chief_worker_0 has joined.
[2019-12-20 22:34:09.901986] NOTICE: LauncherShell: parameter_server_0 has joined.
[2019-12-20 22:34:09.902071] NOTICE: LauncherShell: Launcher closed.
@JacobHanouna sorry to bother you friend, do you have any comment about the question above regarding standard expected behavior?
@greg234234 , the intended grace exit routine was to explicitly interrupt jupyter kernel, as mentioned :
Press Ctrl-C
or jupyter:[Kernel]->[Interrupt] to stop training and close launcher
This should kill all spawned processes for sure. Any other ways to stop training usually lead to orphaned processes (usually - btgym data_server). Besides, this routine kills tf-distributed parameter server process winch is in fact infinite loop. As @JacobHanouna mentioned, you supposed to evaluate model performance periodically while training by supplying workers with evaluation data (or yo can run dedicated evaluator process). Proper routine to use trained model outside training loop is to save checkpoint and restore model later (in jupyter, console python mode or even make a SavedModel export and run model inside C++ runtime ) for prediction.
Orphaned processes checklist (found by port number):
all btgym_env port numbers (btgym_shell - btgym_server communication), incremented by one when running distributed training, see here: https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/launcher/base.py#L208
btgym data_server port number (btgym_servers - data_server communication), single number, (see above)
all tf-cluster workers ports, set here: https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/launcher/base.py#L271
tf-cluster parameter_server port (see above)
Can we make an update to README.md that defines what versions of the software are needed?
The reason for the issue is that with some configurations the backgrader graphs show after training, and with other configurations the Worker threads exit but things stop and hang there.
Thanks for any input!