avisingh599 / reward-learning-rl

[RSS 2019] End-to-End Robotic Reinforcement Learning without Reward Engineering
https://sites.google.com/view/reward-learning-rl/
Other
367 stars 68 forks source link

VICEClassifier training labels confusion #22

Closed Xingtao closed 4 years ago

Xingtao commented 4 years ago

Hi, I am confused with labels in vice classifer training.

Observations consist with

np.concatenate([negatives, positives]

and the labels are assigned with

        labels_batch = np.zeros((2*self._classifier_batch_size,2), dtype=np.int32)
        labels_batch[:self._classifier_batch_size, 0] = 1
        labels_batch[self._classifier_batch_size:, 1] = 1

and arguments of 'softmax_cross_entropy' are

        log_pi_log_p_concat = tf.concat([log_pi, log_p], axis=1)
        self._classifier_loss_t = tf.reduce_mean(
            tf.losses.softmax_cross_entropy(
                self._label_ph,
                log_pi_log_p_concat,
            )
        )

so, the negative sample has output 'log_pi' with label 1, the positive sample has output 'log_pi' with label 0; the negative sample has classifer 'log_p' with label 0, the positive sample has classifer 'log_p' with label 1;

Why 'log_pi' for negative sample is labelled 1?

avisingh599 commented 4 years ago

The VICE discriminator is given by \frac{p}{p + pi}. So, if I pass \log_p and \log_pi to the softmax for 1 and 0 labels accordingly, I get \frac{p}{p + pi} as the probability for the datapoint belonging to label 1 (softmax(x1, x2)[x1] = exp(x1) \ exp(x1) + exp(x2)). Please let me know if this makes sense.

Xingtao commented 4 years ago

Thanks for your explanation