PKU-Alignment / Safe-Policy-Optimization

NeurIPS 2023: Safe Policy Optimization: A benchmark repository for safe reinforcement learning algorithms
https://safe-policy-optimization.readthedocs.io/en/latest/index.html
Apache License 2.0
331 stars 45 forks source link

Question about the torch.size of loss_pi in focops implement #68

Closed lijie9527 closed 1 year ago

lijie9527 commented 1 year ago

I found in focops.py : ratio = torch.exp(log_prob - log_prob_b) temp_kl = torch.distributions.kl_divergence( distribution, old_distribution_b ).sum(-1, keepdim=True) loss_pi = (temp_kl - (1 / FOCOPS_LAM) ratio adv_b) * ( temp_kl.detach() <= dict_args['target_kl'] ).type(torch.float32) loss_pi = loss_pi.mean() Assuming minibatch size=64, temp_kl.shape=(64,1) due to keepdim=True used in calculating temp_kl, but other variables in loss_pi = (64,) which makes loss_pi.shape=(64,64) instead of (64,1) or (64,). So, Why not keep the same dimensions?

Gaiejj commented 1 year ago

We refer to the implementation of the original focops algorithm for loss function computation, see here.

lijie9527 commented 1 year ago

Thanks for your reply.