Open zhixuan-lin opened 2 years ago
When data is limited, the reward function can be trained more easily by clipping rewards. That's the reason for reward clipping.
As for why applying value transformation, we choose the cross-entropy loss for the reward prediction to learn reward distributions, which is different from the MSE loss between the scalars. Moreover, since we choose value prefix in place of rewards, the output reward
is not in the range of [-1, 1], but [-5, 5].
Hope this can help you:)
Hello,
Thanks for this great work! I noticed that you choose to clip the reward to [-1, 1] for Atari. I'm wondering what's the purpose of applying value transformation (i.e.
scalar_transform
) if you already have the reward clipped?