EndingCredits / nn_q_learning_tensorflow

Neural network based Q-learning in tensorflow
7 stars 2 forks source link

How to run Atari environments? #1

Open AjayTalati opened 7 years ago

AjayTalati commented 7 years ago

Hi @EndingCredits,

this is really cool that you got the NEC working :+1:

Have you tried to run your code on the Atari environments, in Open AI gym?

I tried to train on Pong, but I got this error,

(tf_py2_tf_1_0) ajay@ajay-h8-1170uk:~/PythonProjects/nn_q_learning_tensorflow-master$ python main.py --env PongDeterministic-v3
Namespace(EWC=0.0, EWC_decay=0.999, batch_size=4, beta=0, chk_dir=None, chk_name='model', discount=0.9, display_step=2500, double_q=1, env='PongDeterministic-v3', epsilon=0.1, epsilon_anneal=500000, epsilon_final=0.1, layer_sizes=[20], learning_rate=0.001, memory_size=1000, play_from=None, reg=0, resume_from=None, target_step=1000, training_iters=500000, use_target=True)
2017-04-27 10:54:06.210986: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 10:54:06.211009: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 10:54:06.211018: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
[2017-04-27 10:54:06,211] Making new env: PongDeterministic-v3
WARNING:tensorflow:From main.py:314: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
[2017-04-27 10:54:06,571] From main.py:314: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
  0%|                                      | 0/500000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 470, in <module>
    tf.app.run()
  File "/home/ajay/anaconda3/envs/tf_py2_tf_1_0/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "main.py", line 326, in main
    act, q_ = agent.predict(state)
  File "main.py", line 101, in predict
    q = self.session.run(self.pred_q, feed_dict={self.state: [state]})
  File "/home/ajay/anaconda3/envs/tf_py2_tf_1_0/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/home/ajay/anaconda3/envs/tf_py2_tf_1_0/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 961, in _run
    % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 210, 160, 3) for Tensor u'Placeholder:0', which has shape '(?, 210)'

I guess it might be related to TF v1.0, does this repo use an earlier version?

Thank a lot for your help,

Aj

EndingCredits commented 7 years ago

Hi Aj, Unfortunately this code is only set up for environments where the input is a 1-dimensional vector. It isn't too hard to adapt it to image observations, although getting it so inputs are the last 4 frames is a bit of a pain. (Yes, I use an earlier version of TF, but that isn't the issue here, although it may cause problems elsewhere.)

Do the implementations run for you if you run Cartpole-v0? It would be nice to know that this works on other machines.

Also, main.py runs DQN (with some extras), a2c.py runs an actor-advantage-critic algorithm (using replay mem rather than distributed rep) and NEC.py runs the NEC agent.

-Will

EDIT: Just remembered that a2c.py should be set up to work with Atari envs. Have a look at that if you want to look at adapting the others.

AjayTalati commented 7 years ago

Hi Will,

thanks a lot for the help :+1: - I'm just working through your code and the paper now.

Cartpole seems to work well - I haven’t checked against my A3C implementations, but from memory I think it looks better,

11:28:21,  450000/500000it | avg_r: 1.000, avg_q: 9.169, avr_ep_r: 113.7, max_ep_r: 127.0, num_eps: 22, epsilon: 0.100, ewc:  0.0
11:28:28,  452500/500000it | avg_r: 1.000, avg_q: 9.258, avr_ep_r: 104.6, max_ep_r: 137.0, num_eps: 24, epsilon: 0.100, ewc:  0.0
11:28:34,  455000/500000it | avg_r: 1.000, avg_q: 9.130, avr_ep_r: 118.7, max_ep_r: 200.0, num_eps: 21, epsilon: 0.100, ewc:  0.0
11:28:40,  457500/500000it | avg_r: 1.000, avg_q: 9.561, avr_ep_r: 110.0, max_ep_r: 200.0, num_eps: 23, epsilon: 0.100, ewc:  0.0
11:28:46,  460000/500000it | avg_r: 1.000, avg_q: 9.550, avr_ep_r: 89.7, max_ep_r: 200.0, num_eps: 28, epsilon: 0.100, ewc:  0.0
11:28:53,  462500/500000it | avg_r: 1.000, avg_q: 9.625, avr_ep_r: 133.0, max_ep_r: 200.0, num_eps: 19, epsilon: 0.100, ewc:  0.0
11:28:59,  465000/500000it | avg_r: 1.000, avg_q: 9.554, avr_ep_r: 113.6, max_ep_r: 149.0, num_eps: 22, epsilon: 0.100, ewc:  0.0
11:29:05,  467500/500000it | avg_r: 1.000, avg_q: 9.576, avr_ep_r: 130.9, max_ep_r: 200.0, num_eps: 19, epsilon: 0.100, ewc:  0.0
11:29:12,  470000/500000it | avg_r: 1.000, avg_q: 9.381, avr_ep_r: 126.8, max_ep_r: 169.0, num_eps: 19, epsilon: 0.100, ewc:  0.0
11:29:18,  472500/500000it | avg_r: 1.000, avg_q: 9.605, avr_ep_r: 137.9, max_ep_r: 200.0, num_eps: 18, epsilon: 0.100, ewc:  0.0
11:29:24,  475000/500000it | avg_r: 1.000, avg_q: 9.462, avr_ep_r: 136.7, max_ep_r: 200.0, num_eps: 19, epsilon: 0.100, ewc:  0.0
11:29:31,  477500/500000it | avg_r: 1.000, avg_q: 9.304, avr_ep_r: 118.1, max_ep_r: 142.0, num_eps: 21, epsilon: 0.100, ewc:  0.0
11:29:37,  480000/500000it | avg_r: 1.000, avg_q: 9.319, avr_ep_r: 98.4, max_ep_r: 121.0, num_eps: 26, epsilon: 0.100, ewc:  0.0
11:29:43,  482500/500000it | avg_r: 1.000, avg_q: 9.044, avr_ep_r: 119.0, max_ep_r: 200.0, num_eps: 21, epsilon: 0.100, ewc:  0.0
11:29:50,  485000/500000it | avg_r: 1.000, avg_q: 9.231, avr_ep_r: 109.2, max_ep_r: 158.0, num_eps: 22, epsilon: 0.100, ewc:  0.0
11:29:56,  487500/500000it | avg_r: 1.000, avg_q: 9.153, avr_ep_r: 113.5, max_ep_r: 188.0, num_eps: 22, epsilon: 0.100, ewc:  0.0
11:30:02,  490000/500000it | avg_r: 1.000, avg_q: 9.372, avr_ep_r: 124.8, max_ep_r: 200.0, num_eps: 20, epsilon: 0.100, ewc:  0.0
11:30:08,  492500/500000it | avg_r: 1.000, avg_q: 9.031, avr_ep_r: 144.8, max_ep_r: 200.0, num_eps: 17, epsilon: 0.100, ewc:  0.0
11:30:15,  495000/500000it | avg_r: 1.000, avg_q: 9.136, avr_ep_r: 98.4, max_ep_r: 200.0, num_eps: 26, epsilon: 0.100, ewc:  0.0
11:30:21,  497500/500000it | avg_r: 1.000, avg_q: 9.100, avr_ep_r: 149.2, max_ep_r: 200.0, num_eps: 17, epsilon: 0.100, ewc:  0.0
100%|████████████████████████| 500000/500000 [20:46<00:00, 401.28it/s]

I'll try to get it working for the Atari envs too :) If you're interested there's a fairly clean implement in PyTorch

Looks like a fun project :+1:

All the best - Aj

PS - I've only read the paper quickly, but it seems there's no need for the actor-critic stuff in a2c?

AjayTalati commented 7 years ago

Hi Will,

I was wondering whether you got this working for 2D pixel inputs, i.e. Atari.

If so did you manage to get anywhere close to DMs published results, (I guess they do a lot of model searching/hyper-parameter tuning) ?

All the best, Aj

EndingCredits commented 7 years ago

Hi Aj, I did get it working for an Atari setting (ALE), but I haven't managed to get any good results yet.

Code is a bit of a mess, so will probably tidy it up before sharing. -Will

EndingCredits commented 7 years ago

Update: You can find the repo here https://github.com/EndingCredits/Neural-Episodic-Control . Only extra thing you'll need to install is ALE I think.

AjayTalati commented 7 years ago

Great, thanks very much for your work on it :)

I guess if it does'nt perform SOTA on Atari, (or you can't tune it as well as DM), you'll find some environments where it is strong in - you know the Wolpert and Macready NFL thm,

We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.[1]