alsora commented 5 years ago

DRQN CHANGES

BEFORE:

The get_q_table method (used for selecting next action) was using only the last state as input, even if this is a recurrent neural network. On the other hand the learn method was using batches of elements for training where each element was a sequence of episodes. The recurrent layers were working in this way:

the input is [batch_dim x events_length x num_features] (during get_q_table, both batch and event_length are 1)
The output of the reccurrent layer has the same shape, so it's reshaped to become a 2D array [(batch_dim*events_length) x num_features].
This output goes to last fully connected layer to get some q_tables (number equal to events_length).

Only the last q_table, i.e. the one related to the most recent state was used to choose next action.

Similarly rewards and actions have a shape [batch_dim x events_length x 1] and we were using the whole array to compute the loss.

This means that if you provide 5 consecutive states (events_length =5 and assume batch_dim = 1 for simplicity) you were computing 5 q_tables and comparing all of them with the 5 rewards in order to get the loss.

NOW:

the get_q_table method uses all the last available states to compute the table, the number of states used is at max self.events_length.

added self.epoch_history = [] variable. this one stores [state] for all the steps in the current episode Note that this information is also in self.last_episode... It's possible to use only one variable? After the LSTM layer, only the output from the "last state" is used to predict q_table, so we have only 1 q_table. This makes sense as it's how RNN are used for classification, where only 1 value is of interest. rnn_output_e = rnn_output_e[:, -1, :] Note that is not necessary anymore to reshape this thing as it's already 2D [batch_dim x num_features].

Similarly also the loss computation is updated. If I provide 5 consecutive states to the network, its only because I want to give some context in order to use LSTM and get more accurate q_table. What I want to predict is always the reward from the last step, so I can drop all the intermediate ones.

rewards_history = tf.reshape(self.r, [-1,self.events_length])
current_rewards = rewards_history[:, -1]

GENERAL SMALL CHANGES

ai_agent.py, human_agent.py, q_agent.py, environment.py: now observe function does not take deck as argument anymore.. if it's needed, you can get it as game.deck
- q_agent.py written epsilon update in a different way, to easy print something when epsilon_max is reached (using epsilon_increment 1e-5 it happens after 5k epochs more or less)

alsora commented 5 years ago

Let me know what do you think of this reasoning.

However I don't see any changes in the result... But the principle looks more correct.

In any case, wait before merging, writing this shit made me notice that I can clean it.

MichelangeloConserva commented 5 years ago

I agree that it looks more correct now. How about the training time. Is it slower or faster with this new way of calculating the q table? How much time have you trained the network?

I've completed the implementation of the self play arena. After merging this branch into the master I'll train the agents in the arena and post some results.

alsora commented 5 years ago

Training time almost the same. The inference is slightly faster.

I also added two variables inside drqn.py

update_after and update_every..

With the current values, it starts to learn after 5000 steps (i.e. 250 epochs) and learns once every 8 steps.

This should avoid overfit.

I was also thinking at adding dropout layers but then I noticed the thing I told you in the email

alsora / deep-briscola

Fix drqn #16

DRQN CHANGES

BEFORE:

NOW:

GENERAL SMALL CHANGES