Weird reward behavior - Githubissues

ran-weii commented 4 years ago

Hi Jia Lian,

I replicated your code in Pytorch and used the same number of hidden units and position only. My reward is also just a function of state not state-action. The reward I use to update the generator is also the discriminator logistic ratio.

The only difference between us might just be the number of trajectories I collect every iteration, learning rate, and regularization. For discriminator I do your random batch update. But for the generator I sweep through all the (state, action) pairs a few times. There could be minor bugs in my code but I trust my implementation.

But I got quite a weird behavior for the reward function. itr 25: 24_learned_rewards itr 75: 74_learned_rewards itr 175: 174_learned_rewards itr 275: 274_learned_rewards itr 475: 474_learned_rewards itr 775: 774_learned_rewards

You see the trend here. What could be the possible reason that this is happening?

Thanks, Ran

HuangJiaLian commented 4 years ago

@rw422scarlet

Hi Ran, your result is quite reasonable. Because all the reward functions (after 75 iterations) you obtained have higher values around x_position=-1 and 0.5, and lower values around -0.5. It makes sense that there are many reward functions that can guide the agent to find the optimal policy.

Actually, the reward function is fairly simple, which only related to the x_position. So maybe you should keep the neural network(NN) for the reward function as simple as possible. So one possible reason for the Weird reward behavior is the NN for reward function is not that simple.

ran-weii commented 4 years ago

@HuangJiaLian

Actually I just thought about it you are very right. x position around -0.25 is very important that it gives the player a motivation to revisit.

I have another question for you which is related to the convergence of this algorithm. I am currently applying this algorithm to Cartpole with the input being the 4 state variables, x, x', theta, theta'.

I did some searches online and found that the best way to monitor GAN training is to keep track of the loss of all networks and discriminator classification accuracies. Like the figure in here: https://machinelearningmastery.com/practical-guide-to-gan-failure-modes/. For us the equivalent classifier prediction would be exp(f)/{exp(f) + pi} and we can use this to calculate accuracy. However, this is not the real probability since the normalization term has been removed for optimization purpose (at least the way I understood it). This makes convergence monitoring difficult.

For example, this is my training history: cartpole_airl_norm_d_score You can see that the generator quickly achieved stable performance. But the discriminator accuracy wasn't reducing like I expected. Especially the quantity I am monitoring isn't really accuracy at all.

When I plot the learned policy it did not resemble the actual policy I used to generate the data.

What do you think is happening here? And what do you think is the best way to monitor training?

Ran

HuangJiaLian / AIRL_MountainCar

Weird reward behavior #1