NVlabs / GA3C

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.
BSD 3-Clause "New" or "Revised" License
652 stars 195 forks source link

Trying to compare this to universe-starter-agent (A3C) #22

Closed nczempin closed 7 years ago

nczempin commented 7 years ago

Setting up openai/universe, I used the "universe starter agent" as a smoke test.

After adjusting the number of workers to better utilize my CPU, I saw the default PongDeterministic-v3 start winning after about 45 minutes.

Then I wanted to try GA3C on the same machine; given that you quote results of 6x or better, I expected it to perform at least as good as that result.

However, it turns out that with GA3C the agent only starts winning after roughly 90 minutes.

I'm assuming that either my first (few) run(s) on the starter agent were just lucky, or that my runs on GA3C were unlucky. Also I assume that the starter agent has other changes from the A3C that you compared GA3C against, at least in parameters, possibly in algorithm.

So, what can I (an experienced software engineer but with no background in ML), do to make the two methods more comparable on my machine? Is it just a matter of tweaking a few parameters? Is Pong not a good choice to make the comparison?

I have an i7-3930k, a GTX 1060 (6 GB) and 32 GB of RAM.

ashern commented 7 years ago

The OpenAI agent uses an LSTM policy network & GAE for the loss function.

This repo has a far simpler implementation of A3C, using a vanilla feed forward network for the policy & I'm pretty sure using a less recent loss function (though I haven't confirmed that last point recently).

While I personally had high hopes that this implementation would be useful for speeding things up, I've recently gone back to working with the OpenAI framework for my testing. I think some people have been working to get the LSTM policy working w/ GPU based A3C, but I haven't seen any working code that improves on the OpenAI type model....

I'd love to be corrected if I'm incorrect on any of the above.

nczempin commented 7 years ago

ok, that explains it.

Is "get LSTM policy working with GA3C" an open research problem or merely a matter of implementation details?

nczempin commented 7 years ago

And does Pong happen to be particularly sensitive to LSTM or would it be no different in the other Atari games?

swtyree commented 7 years ago

I did a few tests with the universe starter agent when it was just released. Based on that limited experience, it seemed that the setup was a bit overfit to Pong--performance was reasonable for other games, but exceptionally fast for Pong. But as the previous commenter mentioned, it also uses an LSTM and GAE, which are helpful in some cases. If you run more extensive tests, I'd be curious to know how it performs on a wider suite of games.

nczempin commented 7 years ago

okay; anyones in particular I should try?

On 24 March 2017 at 15:35, swtyree notifications@github.com wrote:

I did a few tests with the universe starter agent when it was just released. Based on that limited experience, it seemed that the setup was a bit overfit to Pong--performance was reasonable for other games, but exceptionally fast for Pong. But as the previous commenter mentioned, it also uses an LSTM and GAE, which are helpful in some cases. If you run more extensive tests, I'd be curious to know how it performs on a wider suite of games.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVlabs/GA3C/issues/22#issuecomment-289039757, or mute the thread https://github.com/notifications/unsubscribe-auth/ABsunYj60Qwjk0Lont7EQhuj0ubhWKU_ks5ro9S6gaJpZM4MoUlS .

nczempin commented 7 years ago

I did run it on CoasterRacer, and that "never" (for an impatient layperson) seemed to get anywhere; the difference there as compared to another racing game I briefly tried (Dusk Racer) is that it takes a significant amount of effort to ever even get a single reward.

On 24 March 2017 at 15:36, Nicolai Czempin nicolai.czempin@gmail.com wrote:

okay; anyones in particular I should try?

On 24 March 2017 at 15:35, swtyree notifications@github.com wrote:

I did a few tests with the universe starter agent when it was just released. Based on that limited experience, it seemed that the setup was a bit overfit to Pong--performance was reasonable for other games, but exceptionally fast for Pong. But as the previous commenter mentioned, it also uses an LSTM and GAE, which are helpful in some cases. If you run more extensive tests, I'd be curious to know how it performs on a wider suite of games.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVlabs/GA3C/issues/22#issuecomment-289039757, or mute the thread https://github.com/notifications/unsubscribe-auth/ABsunYj60Qwjk0Lont7EQhuj0ubhWKU_ks5ro9S6gaJpZM4MoUlS .

ashern commented 7 years ago

The appendix of the original A3C paper has a ton of comparisons across different games & models, which should help you avoid some testing.

LSTM A3C is widely implemented open-source - a quick search should turn up a few options. The Universe & Miyosuda implementations seems to be the most commonly used.

nczempin commented 7 years ago

The appendix of the original A3C paper has a ton of comparisons across different games & models, which should help you avoid some testing.

Not sure what this refers to; are you saying I could have avoided wasting time on CoasterRacer by being more aware of the comparisons? My goal was just to "play around with openai universe" rather than get deep into testing. If anything, I'd be interested in adding an environment such as MAME or one of the other emulators, which is more obviously an engineering task.

LSTM A3C is widely implemented open-source - a quick search should turn up a few options. The Universe & Miyosuda implementations seems to be the most commonly used

Is this a response to my question about GA3C with LSTM? If so, the implicit assumption is that there are no fundamental issues that would complicate an endeavour to do so, for example by looking at the A3C implementations. Is this what you're saying? My understanding from the GA3C paper is that they consider it to be a fundamental approach and that A3C just happened to perform the best, so adding LSTM should not be a big deal.

nczempin commented 7 years ago

also, what would be a better venue to have discussions such as this one? Don't really want to clutter up the project issues.

ashern commented 7 years ago

I simply meant - there exists a readily available corpus of tests conducted by professional researchers. Use it as you wish.

Implementing LSTM policy is simply an engineering issue, albeit a moderately difficult one in this case. Have at it - and please publish if you get good results! There are other issues open in this repo, I believe, where there are already discussions around LSTM/GPU.

4SkyNet commented 7 years ago

@nczempin you should add a GAE, cuz it's the most crucial part but easy to implement. LSTM don't really affect your results so much (but LSTM can helps a bit with dynamics of the game, but it's more policy oriented). PS> see some results from vanilla article (last page)

nczempin commented 7 years ago

I simply meant - there exists a readily available corpus of tests conducted by professional researchers. Use it as you wish.

Well, the "which ones should I try" was really offering my "services" to @swtyree: in case I make some more comparisons with my setup anyway it doesn't make a big difference to me which other roms I try, so if someone does have a preference, I might as well support that.

Implementing LSTM policy is simply an engineering issue, albeit a moderately difficult one in this case. Have at it - and please publish if you get good results!

"Publish" sounds intimidating to me, but if I do get anything off the ground, I promise to put the source up on github; perhaps fork and PR here. I probably have to brush up my background in this area a little first (and I definitely have some things I'd like to do first, as mentioned before), so don't hold your breath.

There are other issues open in this repo, I believe, where there are already discussions around LSTM/GPU.

I saw an issue on the universe starter agent, asking about GPU. It doesn't seem to have gone anywhere.

mbz commented 7 years ago

Please check out the pull requests section. GAE has been already implemented by @etienne87 in this pull request. He also implemented an specific pre-processing and configuration which provides a better comparison with starter agent.

nczempin commented 7 years ago

@nczempin you should add a GAE, cuz it's the most crucial part but easy to implement. LSTM don't really affect you results so much (but LSTM can helps a bit with dynamics of the game). PS> see some results from vanilla article (last page)

GAE being? All I get is Google App Engine, and I don't find a reference to the term in the A3C paper.

Edit: Generalized Advantage Estimation.

Please check out the pull requests section. GAE has been already implemented by @etienne87 in this pull request. He also implemented an specific pre-processing and configuration which provides a better comparison with starter agent.

I'll have a look at that. Should I use a different game from the purportedly overfitted Pong, or would it be fine? I guess we'd know the answer when/if I try...

mbz commented 7 years ago

GAE stands for Generalized Advantage Estimation. It's always a good idea to start by Pong (since it's usually the fastest to converge), but as long as you avoid pong specific logic. things should generalize to other games as well.

nczempin commented 7 years ago

Okay, I checked out the PR, but it breaks the dependencies on the vanilla openai-universe.

I'm willing to give it a whirl if it once the PR is in a usable state more or less as-is.

4SkyNet commented 7 years ago

if you see some results from original paper, there are some good environments such as: Amidar, Berzerk and Krull for faster converge. But DeepMind trained all of these games with the same parameters, since that the gamma (discount factor) can be taken for each environment individually to get the better results.

nczempin commented 7 years ago

So am I right in assessing that my issue #22 essentially boils down to issue #3?

nczempin commented 7 years ago

or should I rename it to something that specifically references GAE?

nczempin commented 7 years ago

Ah, I see what you mean, with those games LSTM didn't actually help that much (although the same holds for pong). GAE doesn't seem to be isolated in the table; I guess I'll have to read the paper a bit more.

In the meantime I'll give Amidar a whirl. You seem to have picked the bold ones in the "A3C FF, 1 day" column, would it also make sense to try Seaquest, if FF 1 day vs. LSTM is what we're looking at?

On 24 March 2017 at 17:04, Dennis Korotyaev notifications@github.com wrote:

if you see some results https://arxiv.org/pdf/1602.01783.pdf#19 from original paper, there are some good environments such as: Amidar, Berzerk and Krull for faster converge. But DeepMind trained all of these games with the same parameters, since that the gamma (discount factor) can be taken for each environment individually to get the better results.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVlabs/GA3C/issues/22#issuecomment-289065804, or mute the thread https://github.com/notifications/unsubscribe-auth/ABsunQe8uFrIsGD6vHSvYtm9izPbGfz8ks5ro-mDgaJpZM4MoUlS .

4SkyNet commented 7 years ago

@nczempin you can try with Seaquest, Boxing or other similar with more policy oriented approach than value (Breakout). PS> I prefer Boxing cuz it's simply enough, but it takes a lot of time to see some distinguishes from a random (8mil for me for an almost vanilla A3C), than Breakout for example

nczempin commented 7 years ago

So after around 24000 seconds (400 minutes, 6.6667 hours), here's what I get with GA3C with my 3930k, 32 GB and GTX 1060 (6GB): [Time: 23999] [Episode: 20379 Score: 350.0000] [RScore: 268.7030 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 23999] [Episode: 20380 Score: 317.0000] [RScore: 268.7230 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24001] [Episode: 20381 Score: 355.0000] [RScore: 268.8350 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24004] [Episode: 20382 Score: 295.0000] [RScore: 268.8870 RPPS: 966] [PPS: 1258 TPS: 210] [NT: 16 NP: 2 NA: 9]

[Time: 24008] [Episode: 20383 Score: 247.0000] [RScore: 268.8910 RPPS: 965] [PPS: 1258 TPS: 210] [NT: 16 NP: 3 NA: 8]

It seemed to make progress right from the start, unlike with Pong, where both algorithms seemed to be clueless for a while and then "suddenly get it" and no longer lose, followed by a very long time of very slow growth of average score (the points it conceded always seemed to be the very first few ones, once it had one a single point it seemed to go into very similar states.

GA3C on Amidar seems to be stuck just under 270; I will now see what I get on the same machine with universe-starter-agent.

ifrosio commented 7 years ago

Based on the latest version of our paper, we get more stable and faster convergence for TRAINING_MIN_BATCH_SIZE = 20 ... 40 in Config.py. If you haven't done it yet, you can try this.

nczempin commented 7 years ago

Based on the latest version of our paper, we get more stable and faster convergence for TRAINING_MIN_BATCH_SIZE = 20 ... 40 in Config.py. If you haven't done it yet, you can try this.

On Pong again or on any of the other ones I'll try?

4SkyNet commented 7 years ago

@nczempin DeepMind reaches almost 284 within 1 day (80 millions). You result isn't so bad, since that DeepMind selects 5 best runs from 50 and averaged it. You also can encounter with some saturation or exploration problem after some time. If you use RMSProp as target optimizer you can anneal the learning rate a bit slower. PS> and, as you can see, DeepMind has some instability in training. It seems that a Hogwild can cause such issue, but it also occurs with more synchronize way.

ifrosio commented 7 years ago

The improvement with TRAINING_MIN_BATCH_SIZE should be observed for all games (although we tested few of them only).

nczempin commented 7 years ago

Here's the situation with (the universe starter agent) python3 train.py --num-workers 6 --env-id Amidar-v0 --log-dir /tmp/amidar after roughly 8 hours:

image

nczempin commented 7 years ago

I picked 6 workers because that's how many cores my CPU has, but perhaps up to 12 could have helped, given Hyperthreading etc. But naive "analysis" suggests that GA3C still wins in this particular case, because it gets more than double the score.

It would be interesting to know how much the speedup is due to using the CPU cores more efficiently because of the dynamic load balancing vs. including the GPU.

Even just getting a dynamic number of threads, without any specific GPU improvements, is a big convenience over having to pick them yourself statically.

nczempin commented 7 years ago

GA3C PongDeterministic-v3 again with TRAINING_MIN_BATCH_SIZE=20: [Time: 2702] [Episode: 3682 Score: -21.0000] [RScore: -20.2860 RPPS: 1513] [PPS: 1513 TPS: 51] [NT: 5 NP: 4 NA: 26] ... [Time: 8993] [Episode: 7191 Score: -7.0000] [RScore: -14.3570 RPPS: 1514] [PPS: 1286 TPS: 44] [NT: 6 NP: 3 NA: 43]

nczempin commented 7 years ago

GA3C PongDeterministic-v3 again with TRAINING_MIN_BATCH_SIZE=40:

[Time: 2701] [Episode: 3988 Score: -21.0000] [RScore: -20.1950 RPPS: 1663] [PPS: 1637 TPS: 31] [NT: 7 NP: 3 NA: 44]

[Time: 5402] [Episode: 6053 Score: -13.0000] [RScore: -17.0820 RPPS: 1628] [PPS: 1512 TPS: 29] [NT: 11 NP: 2 NA: 35]

[Time: 8996] [Episode: 7551 Score: -10.0000] [RScore: -13.0080 RPPS: 1609] [PPS: 1494 TPS: 28] [NT: 15 NP: 4 NA: 32]

nczempin commented 7 years ago

GA3C PongDeterministic-v3 again with TRAINING_MIN_BATCH_SIZE=40, with GAE changes from https://github.com/NVlabs/GA3C/pull/18: [Time: 2701] [Episode: 3118 Score: -12.0000] [RScore: -16.1090 RPPS: 1939] [PPS: 1915 TPS: 36] [NT: 6 NP: 1 NA: 41]

Still not reaching that "starting to win after 45 minutes" I get with universe-starter-agent.

[Time: 4759] [Episode: 3968 Score: 8.0000] [RScore: -8.3390 RPPS: 1966] [PPS: 1939 TPS: 37] [NT: 9 NP: 1 NA: 45] ... [Time: 5192] [Episode: 4251 Score: 19.0000] [RScore: 0.0220 RPPS: 2009] [PPS: 1950 TPS: 37] [NT: 7 NP: 3 NA: 46] ... [Time: 5401] [Episode: 4405 Score: 17.0000] [RScore: 4.4630 RPPS: 2018] [PPS: 1955 TPS: 37] [NT: 9 NP: 3 NA: 46]

nczempin commented 7 years ago

GA3C Amidar-v0 with TRAINING_MIN_BATCH_SIZE=40, with GAE changes from https://github.com/NVlabs/GA3C/pull/18:

[Time: 10374] [Episode: 13975 Score: 296.0000] [RScore: 219.0260 RPPS: 1919] [PPS: 1882 TPS: 35] [NT: 16 NP: 8 NA: 55] ... [Time: 10801] [Episode: 14456 Score: 2.0000] [RScore: 154.3180 RPPS: 1713] [PPS: 1867 TPS: 35] [NT: 14 NP: 5 NA: 52] (trying something new? there was a string of 2.00 scores)

nczempin commented 7 years ago

BTW I should probably compile my own Tensorflow, not sure how much effect it will have though:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
nczempin commented 7 years ago

Hm, there seems to be something wrong with _play.sh. I decided I wanted to pause the training and have a look at an agent playing, perhaps seeing why it only got 2 points all of a sudden.

Naively, I thought I could just let one agent run in parallel to the training; it should not affect the big picture overall.

But I got an error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [256,6] rhs shape= [256,10]
     [[Node: save/Assign_22 = Assign[T=DT_FLOAT, _class=["loc:@logits_p/w"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](logits_p/w/RMSProp_1, save/RestoreV2_22/_11)]]

Caused by op u'save/Assign_22', defined at:
  File "GA3C.py", line 59, in <module>
    Server().main()
  File "/home/nczempin/git/ml/GA3C/ga3c/Server.py", line 48, in __init__
    self.model = NetworkVP(Config.DEVICE, Config.NETWORK_NAME, Environment().get_num_actions())
  File "/home/nczempin/git/ml/GA3C/ga3c/NetworkVP.py", line 65, in __init__
    self.saver = tf.train.Saver({var.name: var for var in vars}, max_to_keep=0)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1040, in __init__
    self.build()
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1070, in build
    restore_sequentially=self._restore_sequentially)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 675, in build
    restore_sequentially, reshape)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 414, in _AddRestoreOps
    assign_ops.append(saveable.restore(tensors, shapes))
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 155, in restore
    self.op.get_shape().is_fully_defined())
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
    use_locking=use_locking, name=name)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/nczempin/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [256,6] rhs shape= [256,10]
     [[Node: save/Assign_22 = Assign[T=DT_FLOAT, _class=["loc:@logits_p/w"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](logits_p/w/RMSProp_1, save/RestoreV2_22/_11)]]

So I thought that perhaps you're not meant to run _train.sh and _play.sh concurrently, so, thinking that _train.py would just pick up from a checkpoint I stopped it and then tried running _play.sh.

Turns out I got the same error.

So I thought, perhaps it's because of the GAE changes, so I checked out the master branch and tried again. Same result.

So right now I'm not sure what's going on; there may be something in the GAE changes that modifies the written data so that there is a problem when reading it back?

Then I tried continuing the _train.sh process, and was slightly surprised that the time started back at 0, not at the point where I had left it.

Right now it's a little hard for me to tell if there was an inadvertent _clean.sh thrown in somewhere, or if this is expected behaviour and I just need to add the time at which I stopped to the new time value, or if this is an error caused by the GAE changes.

Edit: I notice that in the stack trace python 2.7 is mentioned. I try to run everything with python3.5, but my default ubuntu setup seems to link /usr/bin/python to python2.7; when I change it, some other programs no longer work. Edit 2: Scratch that, making sure to run it with 3.5 I get the equivalent error message, just with references to 3.5 Edit 3: And the time value is nothing to worry about, ProcessStats.py doesn't measure the time relative to the start of the training, but of the "session"

nczempin commented 7 years ago

Okay, I'll stop that Amidar run for now, because I need the machine for other things. There was some wonky stuff going on; I'll save the results.txt (what's left of it after the restart) and the checkpoints/ directory, just in case anyone wants to have a look.

[Time: 1327] [Episode: 18465 Score: 2.0000] [RScore: 10.8540 RPPS: 1866] [PPS: 1857 TPS: 35] [NT: 8 NP: 5 NA: 33]

4SkyNet commented 7 years ago

If you use gym for your experiments -> try to use all in Deterministic mode, since that Amidar-v0 should be AmidarDeterministic-v3 or wrap v0 manually to become a deterministic with constant frame skipping. PS> DeepMind's 1-day is equal to 80mil and they use 16 parallel workers.

nczempin commented 7 years ago

If you use gym for your experiments -> try to use all in Deterministic mode, since that Amidar-v0 should be AmidarDeterministic-v3 or wrap v0 manually to become a deterministic with constant frame skipping. PS> DeepMind's 1-day is equal to 80mil and they use 16 parallel workers.

Oh, okay, I'll do that. Would that explain the sudden meltdown?

4SkyNet commented 7 years ago

@nczempin no, mostly not --> but it can affect on the learning process in time and quality >> stochastic gym's v0 environments can lead you agent from the same state to one another within 2..5 frames randomly: https://github.com/openai/gym/blob/master/gym/envs/atari/atari_env.py#L80 You can also manually control it something like (or use v-3):

from gym.wrappers.frame_skipping import SkipWrapper
...
frame_skip = 4
self.gym = gym.make(env)
if frame_skip is not None:
    skip_wrapper = SkipWrapper(frame_skip)
    self.gym = skip_wrapper(self.gym)
nczempin commented 7 years ago

Hm.

with ./_clean.sh;./_train.sh ATARI_GAME='BoxingDeterministic-v3' TRAINING_MIN_BATCH_SIZE=40 with GAE I seem to get a similarly strange behaviour; I'm guessing it's a bug in the GAE code (I will let the non-GAE version run overnight for comparison):

[Time: 11415] [Episode: 12890 Score: 4.0000] [RScore: 3.2790 RPPS: 1912] [PPS: 1909 TPS: 36] [NT: 25 NP: 2 NA: 31]

Here's the "high water mark" of RScore (average score over the default 1000 episodes): [Time: 7652] [Episode: 9265 Score: 68.0000] [RScore: 60.2050 RPPS: 1905] [PPS: 1910 TPS: 36] [NT: 22 NP: 4 NA: 34]

A wild guess would be that some integer overflows somewhere.

nczempin commented 7 years ago

./_clean.sh;./_train.sh ATARI_GAME='BoxingDeterministic-v3' TRAINING_MIN_BATCH_SIZE=40 without the changes from the GAE PR

[Time: 3601] [Episode: 2634 Score: 5.0000] [RScore: 1.6630 RPPS: 1560] [PPS: 1565 TPS: 30] [NT: 6 NP: 2 NA: 35] [Time: 5885] [Episode: 5200 Score: 81.0000] [RScore: 68.2340 RPPS: 1583] [PPS: 1579 TPS: 30] [NT: 8 NP: 3 NA: 36] [Time: 7200] [Episode: 8165 Score: 88.0000] [RScore: 88.1870 RPPS: 1575] [PPS: 1579 TPS: 30] [NT: 7 NP: 3 NA: 34] [Time: 8045] [Episode: 10298 Score: 100.0000] [RScore: 92.0040 RPPS: 1564] [PPS: 1577 TPS: 30] [NT: 7 NP: 3 NA: 34] [Time: 24041] [Episode: 55034 Score: 96.0000] [RScore: 99.0730 RPPS: 649] [PPS: 1505 TPS: 28] [NT: 8 NP: 5 NA: 20]

4SkyNet commented 7 years ago

Thx for the experiments! I also noticed some meltdowns through the training process with Atari Boxing, for example: A3C-FF_without-GAE boxing-8th-35mil A3C-LSTM1_without-GAE da3c_cur-lstm_8ag_gym_boxing A3C-LSTM2_without-GAE da3c_tf-lstm_8ag_gym_boxing

nczempin commented 7 years ago

Thx for the experiments! I also notice some meltdowns through the training process with Atari Boxing

Hm. That would indicate that I should leave it running for longer and not look at the score.

nczempin commented 7 years ago

And perhaps that I need to clarify my intuition about what the score represents.

So just using your (@4SkyNet) bottom diagram "A3C-LSTM2_without-GAE", my understanding was that at around 20 M, we would have had a network that would have given us 80 points on average, within some variation (but none that would ever, within probabilistic confidence, cause us to go below 40 or so points).

And between 20 and 22.5, we are exploring and not finding improvement, but we could always go back to what we had at 20 M.

And it does not mean that we have found a configuration that turns out after more exploration to only be worth 20 points on average (which is sorta what it feels like when the reported scores start going down).

Is my intuition reasonable?

[and I think by network and configuration above I mean policy]

4SkyNet commented 7 years ago

@nczempin I think these meltdowns can be caused by some agent's generalization issues wrt exploration and these meltdowns are more environment dependent to my mind --> I don't see such behavior for Breakout in comparison to the Boxing, since the first one has more stable environment dynamics.

For example (as I see from my visual output): Boxing agent start to beat the opponent and make some pressure on it. It has to do more pressure to get more rewards through the time. And (suddenly) if your agent comes a little behind the opponent, the game unfolds them in the opposite direction: your agent trains to boxing from left to right, but it has to boxing from right to left from now.

PS> wrt original results DeepMind also has some a bit worse results from 4-day training compare to 1-day (FF).

nczempin commented 7 years ago

PS> wrt original results DeepMind also has some a bit worse results from 4-day training compare to 1-day (FF).

That is confusing me; how can the results be worse after 4 days than after 1 day? Or are you saying this is with different algorithms?

Presumably you can always go back to a previous policy that was better, or is it only that we "thought" it was better and now it turns out that it wasn't?

nczempin commented 7 years ago

Universe starter agent with python3 train.py --num-workers 6 --env-id BoxingDeterministic-v3 --log-dir /tmp/boxingd3

image

etienne87 commented 7 years ago

@nczempin @4SkyNet have you implemented the version with LSTM in TF? I am currently trying in pytorch but I had to add the c, h states in queues of experiences. Also note that for GAE, i let the parameter self.tau to 1, which is perhaps not the best choice & in theory should not change the performance.

nczempin commented 7 years ago

@nczempin have you implemented the version with LSTM in TF? I am currently trying in pytorch but I had to add the c, h states in queues of experiences. Also note that for GAE, i let the parameter self.tau to 1, which is perhaps not the best choice & in theory should not change the performance.

@etienne87 I haven't implemented anything; I just used the GA3C from the head here, from your PR plus the 2 changes, and I'm comparing them to https://github.com/openai/universe-starter-agent (which supposedly has GAE, LSTM, but no GA3C) with the same environments on the same machine (and wondering whether there are any other parameters I should tweak to get the comparisons to be fairer).

It also turns out that the fact that I did not see the improvement of GA3C is most likely due to the universe starter agent being tuned towards Pong (which they state in the Readme).

Maybe I misunderstood the question?

I was scared for a moment that there may be a bug in your GAE, but @4SkyNet clarified that the observations are more likely to be independent of your GAE changes.

4SkyNet commented 7 years ago

That is confusing me; how can the results be worse after 4 days than after 1 day? Or are you saying this is with different algorithms?

No, algorithms are the same (A3C-FF 1-day & A3C-FF 4-day), but results could be worse after 4-day: table_with_results It seems like we get the highest (algo limit perhaps) result (see bolds) and then diverge a bit. For Pong we see the follows (wrt this table): A3C-FF 1-day: 11.4 A3C-FF 4-day: 5.6

Presumably you can always go back to a previous policy that was better, or is it only that we "thought" it was better and now it turns out that it wasn't?

It's hard to have strictly point in this case. Sometimes it really better than the old one (results, mentioned above), sometimes we just "thought" it was better (boxing meltdown example).

@4SkyNet have you implemented the version with LSTM in TF? I am currently trying in pytorch but I had to add the c, h states in queues of experiences.

@etienne87 unfortunately not. LSTM's which I show are from another version of A3C (similar to vanilla, but a bit more synchronous and distributed). I didn't do something with GA3C yet, but it's good to have more versions with LSTM / GAE / Exploration Bonus / etc (may be I turn on it later, I don't really know...)

Also note that for GAE, i let the parameter self.tau to 1

Thx to point it out! @nczempin what tau (lambda) do you use in starter agent? PS> I usually set it to 0.97 for TRPO for example

nczempin commented 7 years ago

@nczempin what tau do you use in starter agent? PS> I usually set it to 0.97 for TRPO for example

I use defaults for anything I don't specifically mention