Define a softmax LSTM architecture

Hi,

I'm trying to combine the A3CLSTMGaussian and A3CFFSoftmax examples to an A3CLSTMSoftmax architecture. Is the following the right way to go? Would you change something?

BTW, If I managed to use A3CFFSoftmax successfully, should I change something in the observations? Namely, should the observation contain history of previous observations or everything is handled for me by Chainer / ChainerRL? One more question, what is the argument t-max used for?

class A3CLSTMSoftmax(chainer.ChainList, a3c.A3CModel):
    def __init__(self, obs_size, action_size, hidden_size=200, lstm_size=128):
        self.pi_head = L.Linear(obs_size, hidden_size)
        self.v_head = L.Linear(obs_size, hidden_size)
        self.pi_lstm = L.LSTM(hidden_size, lstm_size)
        self.v_lstm = L.LSTM(hidden_size, lstm_size)
        self.pi = policies.SoftmaxPolicy(lstm_size, action_size, hidden_sizes=(hidden_size, )
        self.v = v_function.FCVFunction(lstm_size)
        super().__init__(self.pi_head, self.v_head,
                         self.pi_lstm, self.v_lstm, self.pi, self.v)

    def pi_and_v(self, state):

        def forward(head, lstm, tail):
            h = F.relu(head(state))
            h = lstm(h)
            return tail(h)

        pout = forward(self.pi_head, self.pi_lstm, self.pi)
        vout = forward(self.v_head, self.v_lstm, self.v)

        return pout, vout

chainer / chainerrl

Define a softmax LSTM architecture #512