avisingh599 / reward-learning-rl

[RSS 2019] End-to-End Robotic Reinforcement Learning without Reward Engineering
https://sites.google.com/view/reward-learning-rl/
Other
370 stars 68 forks source link

VICE vs. SACClassifier #14

Closed jgkim2020 closed 5 years ago

jgkim2020 commented 5 years ago

This is not a issue about the code implementation per se, but rather a question on the difference between two algorithms.

It seems that the VICE class implementation follows the equation from the original paper as well as the RSS paper when training the "logit" f(s) via the softmax discriminator D(s,a) with cross-entropy loss.

However, the SACClassifier class implementation does not use log_pi(a|s) and instead trains the "logit" via the sigmoid discriminator D(s) with cross-entropy loss. Since the SACClassifier utilizes negatives samples (by sampling from the replay buffer) when training the "logit" (or equivalently the event prob.) it doesn't seem to be the "Naive Classifier" case mentioned in the RSS paper.

What is the reasoning/theory behind SACClassifier? Any references (relevant paper, etc.) would be much appreciated :)

jgkim2020 commented 5 years ago

Nevermind, I realized that SACClassifier only trains the classifier on the first episode (self._epoch == 0) and is indeed the "Naive Classifier" case from the paper.

avisingh599 commented 5 years ago

Glad you figured it out! The reason SACClassifier was implemented this non-intuitive way was because it made it extremely simple to implement VICE and VICE-RAQ on top of it.