SaminYeasar / Off_Policy_Adversarial_Inverse_Reinforcement_Learning

Implementation of Off Policy Adversarial Inverse Reinforcement Learning
MIT License
20 stars 2 forks source link

confused about the Discriminator design #1

Closed ZRZ-Unknow closed 3 years ago

ZRZ-Unknow commented 3 years ago

In the paper "discriminator is 2 layer MLP of 100 hidden units with tanh activation. Our generator consists of separate Actor and Critic neural network and follows the architecture used in [5, 8], where both of these networks have 2 layer MLP of 400 and 300 hidden units with ReLU activation" But in your implementation the hidden units and activation in the network is not designed as you have described in the paper, why? And When computing the discriminator's loss, you use:

log_p = reward + gamma * V_ns - V_s
log_q = lprobs
log_pq_concat = torch.cat([log_p, log_q], 1)
log_pq = torch.logsumexp(torch.cat([log_p, log_q], 1).view(len(state), 2), dim=1).view(-1, 1)

loss2 = F.binary_cross_entropy_with_logits(log_pq_concat, torch.ones(log_pq_concat.size()).to(self.device), reduction='sum')
log_D = log_p - log_pq
D = torch.exp(log_D)
return D, loss2

Why this works? D is the output of disciminator, according to the formula in the paper, I think this should make sense:

log_D = log_p - log_pq
D = torch.exp(log_D)
loss2 = F.binary_cross_entropy_with_logits(D, torch.ones(D.size()).to(self.device), reduction='sum')
return D, loss2

But this doesnot seem to work well.

SaminYeasar commented 3 years ago

About the network architecture, you can use the code base as it is and this should work fine. I may have updated the network architecture for making it work on retraining (will check and update the paper). But both architecture should work for imitation. And about the discriminator formula, output of torch.exp often not stable to use in the gradient update and thus the later equation that you mentioned did not work for me either. So I decided to avoid it.