Reward clipping and value transformation

YeWR / EfficientZero

Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.

GNU General Public License v3.0

846 stars 131 forks source link

Reward clipping and value transformation #15

Open zhixuan-lin opened 2 years ago

zhixuan-lin commented 2 years ago

Hello,

Thanks for this great work! I noticed that you choose to clip the reward to [-1, 1] for Atari. I'm wondering what's the purpose of applying value transformation (i.e. scalar_transform) if you already have the reward clipped?

YeWR commented 2 years ago

When data is limited, the reward function can be trained more easily by clipping rewards. That's the reason for reward clipping.

As for why applying value transformation, we choose the cross-entropy loss for the reward prediction to learn reward distributions, which is different from the MSE loss between the scalars. Moreover, since we choose value prefix in place of rewards, the output reward is not in the range of [-1, 1], but [-5, 5].

Hope this can help you:)