Closed MikeTkachuk closed 2 years ago
Hey,
Thanks for pointing that out!
In Section 2.2, we wanted to give a generic description of our method, and a MSE loss is applicable to any environment. However, when the reward function is discrete, one can use a cross-entropy loss. Since it is common to clip the reward to [-1;1] in Atari environments, we decided to employ the latter for our experiments.
We will update Section 2.2 to make it clearer for the reader.
Hi, the reference paper states the following: https://arxiv.org/pdf/2209.00588.pdf 2.2 We train G in a self-supervised manner on segments of L time steps, sampled from past experience. We use a cross-entropy loss for the transition and termination predictors, and a mean-squared error loss for the reward predictor.
However in iris.src.models.world_model.py:111 you use F.cross_entropy. Could you please comment on these choices. Thank you