eloialonso / iris

Transformers are Sample-Efficient World Models. ICLR 2023, notable top 5%.
https://openreview.net/forum?id=vhFu1Acb0xb
GNU General Public License v3.0
791 stars 77 forks source link

MSE for reward predictor #5

Closed MikeTkachuk closed 2 years ago

MikeTkachuk commented 2 years ago

Hi, the reference paper states the following: https://arxiv.org/pdf/2209.00588.pdf 2.2 We train G in a self-supervised manner on segments of L time steps, sampled from past experience. We use a cross-entropy loss for the transition and termination predictors, and a mean-squared error loss for the reward predictor.

However in iris.src.models.world_model.py:111 you use F.cross_entropy. Could you please comment on these choices. Thank you

vmicheli commented 2 years ago

Hey,

Thanks for pointing that out!

In Section 2.2, we wanted to give a generic description of our method, and a MSE loss is applicable to any environment. However, when the reward function is discrete, one can use a cross-entropy loss. Since it is common to clip the reward to [-1;1] in Atari environments, we decided to employ the latter for our experiments.

We will update Section 2.2 to make it clearer for the reader.