Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
322 stars 33 forks source link

Potential Misalignment in p2e_dv2 and p2e_dv3 Implementations with Original Paper #322

Open tallance opened 1 month ago

tallance commented 1 month ago

I've noticed a potential misalignment in the p2e_dv2 and p2e_dv3 implementations regarding what the ensemble predicts. According to the Plan2Explore paper, the ensemble should predict the image embedding, not the posterior state. The implementation in p2e_dv1appears aligned with this:

loss -= next_obs_embedding_dist.log_prob(embedded_obs.detach()[1:]).mean()

However, in p2e_dv2and p2e_dv3, it seems to aim to predict the next (randomized) posterior state:

loss -= next_obs_embedding_dist.log_prob(posteriors.view(sequence_length, batch_size, -1).detach()[1:]).mean()

Could this be an intentional modification, or am I missing something about how these predictions should be handled?