Closed jgkim2020 closed 5 years ago
Nevermind, I realized that SACClassifier only trains the classifier on the first episode (self._epoch == 0) and is indeed the "Naive Classifier" case from the paper.
Glad you figured it out! The reason SACClassifier was implemented this non-intuitive way was because it made it extremely simple to implement VICE and VICE-RAQ on top of it.
This is not a issue about the code implementation per se, but rather a question on the difference between two algorithms.
It seems that the VICE class implementation follows the equation from the original paper as well as the RSS paper when training the "logit" f(s) via the softmax discriminator D(s,a) with cross-entropy loss.
However, the SACClassifier class implementation does not use log_pi(a|s) and instead trains the "logit" via the sigmoid discriminator D(s) with cross-entropy loss. Since the SACClassifier utilizes negatives samples (by sampling from the replay buffer) when training the "logit" (or equivalently the event prob.) it doesn't seem to be the "Naive Classifier" case mentioned in the RSS paper.
What is the reasoning/theory behind SACClassifier? Any references (relevant paper, etc.) would be much appreciated :)