VICEClassifier loss definition confusion

Xingtao commented 4 years ago

Hi,

In vice.py, the classifier loss defined as

     def _init_classifier_update(self):
        log_p = self._classifier([self._observations_ph])
        sampled_actions = self._policy.actions([self._observations_ph])
        log_pi = self._policy.log_pis([self._observations_ph], sampled_actions)
        log_pi_log_p_concat = tf.concat([log_pi, log_p], axis=1)

        self._classifier_loss_t = tf.reduce_mean(
            tf.losses.softmax_cross_entropy(
                self._label_ph,
                log_pi_log_p_concat,
            )
        )
        self._classifier_training_op = self._get_classifier_training_op()

When do classifier training, it seems the gradient will be applied to policy network? Is this the case?

In paper, it is said policy network trained with classifier's output as reward, but not updated when do classifier training. What am I missing ?

Screenshot from 2019-11-27 17-56-03

Thanks

avisingh599 commented 4 years ago

"When do classifier training, it seems the gradient will be applied to policy network? Is this the case?" This is not the case. The ._get_classifier_training_op function (inherited from SACClassifier) ensures that the classifier training op only updates the classifier variables. See this line: https://github.com/avisingh599/reward-learning-rl/blob/8070d93e9379204f153e9044e03079bd9a354183/softlearning/algorithms/sac_classifier.py#L91

Xingtao commented 4 years ago

Ok, thanks

avisingh599 / reward-learning-rl

VICEClassifier loss definition confusion #21