NVlabs / GA3C

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.
BSD 3-Clause "New" or "Revised" License
649 stars 195 forks source link

test with Pong-v0 : not converging ? #7

Closed etienne87 closed 7 years ago

etienne87 commented 7 years ago

I am trying a training right now, I replaced PongDeterministic-v0 by Pong-v0 (the former does not seem to exist in my install), other than that everything is the same in Config.py & other files.

After 2 hours :

[Time: 7795] [Episode: 7923 Score: -20.0000] [RScore: -20.2960 RPPS: 1488] [PPS: 1499 TPS: 251] [NT: 5 NP: 4 NA: 28]

pong-v0

Am I am missing something here? Is there a need to modify Config.py ?


EDIT : needed to update gym; retried with PongDeterministic-v0:

Here it is with Learning Rate = 1e-3 and the game PongDeterministic-v0

[Time: 4177] [Episode: 4397 Score: -9.0000] [RScore: -10.5960 RPPS: 1645] [PPS: 1646 TPS: 278] [NT: 4 NP: 4 NA: 33]

Any idea of the difference between the 2 games ?

pongdeterministic-v0

etienne87 commented 7 years ago

I am looking why #7 is not working and stumbled upon NetworkVP.py.

EDIT : This whole post is wrong, go the next one for clarification. Content here is not the source of difference between performance for Pong-v0

self.softmax_p = (tf.nn.softmax(self.logits_p) + Config.MIN_POLICY) / (1.0 + Config.MIN_POLICY * self.num_actions)
self.selected_action_prob = tf.reduce_sum(self.softmax_p * self.action_index, reduction_indices=1)

Which is smart : it takes the weighted average of indices]. But this seems really deterministic. Why not using [Boltzmann approach(https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-7-action-selection-strategies-for-exploration-d3a97b7cceaf#.6ccavhg8l)?

I tried for myself in numpy script to compare both boltzmann & soft argmax. :

import numpy as np
def softmax(x):
    x = x/np.max(x)
    elogits = np.exp(x)
    sm = elogits / elogits.sum(axis=1)
    return sm

def softmax_minpolicy(x,min_policy=0.2):
    nout = x.shape[1]
    sm = softmax(x)
    sm1 = (sm + min_policy) / (1.0 + min_policy*nout)
    return sm1

def softargmax(x):
    nout = x.shape[1]
    return np.sum(softmax(x) * np.arange(nout)[np.newaxis,:],axis=1)[0]

def boltzmann_sampling(x,tmp=2.0):
    sm = softmax(x*tmp)
    action_value = np.random.choice(sm[0],p=sm[0])
    action = np.argmax(sm[0] == action_value)
    return action

nout = 10
distrib = np.random.rand(1,nout)

distrib[0,int(0.3*nout)] = 10
distrib[0,int(0.7*nout)] = 10

softmax1 = softmax(distrib)
s1 = softargmax(distrib)
b1 = boltzmann_sampling(distrib,1)

Could it be related to this? With Soft-Argmax function the sampled action will always be the same, which sounds tricky for exploration. action_selection_strategies

mbz commented 7 years ago

Thanks for the analysis. There are two points in your post that shouldn't be confused together:

  1. Pong vs PongDeterministic: By default OpenAI gym repeats the same action for 2, 3 or 4 times. Deterministic changes this behavior to a constant 4 which is coherent with original DQN, A3C and GA3C paper. Almost all of our experiments are on Deterministic version but I don't see any reason why GA3C should not work on the default version. It should fluctuate more which may result into divergence but at the end of the day it should work. A side note is that it's hard to compare the performance only based on one run. Sadly, initialization plays a big role and every run is different.

  2. Boltzmann vs Argmax Note that we are using Boltzmann while training and Argmax while playing. The code that you are referring to provides a filler for selected action which gets feed into TF. Look at here for more details.

etienne87 commented 7 years ago

Thanks a lot @mbz for the clarification! So _self.actionindex, the sampled_action is actually in one-hot format, no soft-argmax is involved. Sorry for confusion.

Anyway my post is not rigorous, I haven't tried Pong-v0 with learning_rate = 1e-3 yet. Will retry, modify & close issue. That is probably the only source of difference!

etienne87 commented 7 years ago

Retrying with LR = 1e-3 did work (average reward of 19).