alsora / deep-briscola

Tensorflow Deep Reinforcement Learning agents playing Briscola card game
21 stars 6 forks source link

Fix drqn #16

Closed alsora closed 5 years ago

alsora commented 5 years ago

DRQN CHANGES

BEFORE:

The get_q_table method (used for selecting next action) was using only the last state as input, even if this is a recurrent neural network. On the other hand the learn method was using batches of elements for training where each element was a sequence of episodes. The recurrent layers were working in this way:

Only the last q_table, i.e. the one related to the most recent state was used to choose next action.

Similarly rewards and actions have a shape [batch_dim x events_length x 1] and we were using the whole array to compute the loss.

This means that if you provide 5 consecutive states (events_length =5 and assume batch_dim = 1 for simplicity) you were computing 5 q_tables and comparing all of them with the 5 rewards in order to get the loss.

NOW:

the get_q_table method uses all the last available states to compute the table, the number of states used is at max self.events_length.

Similarly also the loss computation is updated. If I provide 5 consecutive states to the network, its only because I want to give some context in order to use LSTM and get more accurate q_table. What I want to predict is always the reward from the last step, so I can drop all the intermediate ones.

rewards_history = tf.reshape(self.r, [-1,self.events_length])
current_rewards = rewards_history[:, -1]

GENERAL SMALL CHANGES

alsora commented 5 years ago

Let me know what do you think of this reasoning.

However I don't see any changes in the result... But the principle looks more correct.

In any case, wait before merging, writing this shit made me notice that I can clean it.

MichelangeloConserva commented 5 years ago

I agree that it looks more correct now. How about the training time. Is it slower or faster with this new way of calculating the q table? How much time have you trained the network?

I've completed the implementation of the self play arena. After merging this branch into the master I'll train the agents in the arena and post some results.

alsora commented 5 years ago

Training time almost the same. The inference is slightly faster.

I also added two variables inside drqn.py

update_after and update_every..

With the current values, it starts to learn after 5000 steps (i.e. 250 epochs) and learns once every 8 steps.

This should avoid overfit.

I was also thinking at adding dropout layers but then I noticed the thing I told you in the email