avisingh599 / reward-learning-rl

[RSS 2019] End-to-End Robotic Reinforcement Learning without Reward Engineering
https://sites.google.com/view/reward-learning-rl/
Other
370 stars 68 forks source link

VICEClassifier loss definition confusion #21

Closed Xingtao closed 4 years ago

Xingtao commented 4 years ago

Hi,

In vice.py, the classifier loss defined as

     def _init_classifier_update(self):
        log_p = self._classifier([self._observations_ph])
        sampled_actions = self._policy.actions([self._observations_ph])
        log_pi = self._policy.log_pis([self._observations_ph], sampled_actions)
        log_pi_log_p_concat = tf.concat([log_pi, log_p], axis=1)

        self._classifier_loss_t = tf.reduce_mean(
            tf.losses.softmax_cross_entropy(
                self._label_ph,
                log_pi_log_p_concat,
            )
        )
        self._classifier_training_op = self._get_classifier_training_op()

When do classifier training, it seems the gradient will be applied to policy network? Is this the case?

In paper, it is said policy network trained with classifier's output as reward, but not updated when do classifier training. What am I missing ?

Screenshot from 2019-11-27 17-56-03

Thanks

avisingh599 commented 4 years ago

"When do classifier training, it seems the gradient will be applied to policy network? Is this the case?" This is not the case. The ._get_classifier_training_op function (inherited from SACClassifier) ensures that the classifier training op only updates the classifier variables. See this line: https://github.com/avisingh599/reward-learning-rl/blob/8070d93e9379204f153e9044e03079bd9a354183/softlearning/algorithms/sac_classifier.py#L91

Xingtao commented 4 years ago

Ok, thanks