PKU-Alignment / Safe-Policy-Optimization

NeurIPS 2023: Safe Policy Optimization: A benchmark repository for safe reinforcement learning algorithms
https://safe-policy-optimization.readthedocs.io/en/latest/index.html
Apache License 2.0
321 stars 45 forks source link

Doubt about the updating method of Lagrange Multipliers #69

Closed lijie9527 closed 8 months ago

lijie9527 commented 8 months ago

from safepo.common.lagrange import Lagrange

nu = 1.0 nu_lr = 0.1 ep_cost = 35 lagrange = Lagrange( cost_limit=25.0, lagrangian_multiplier_init=1.0, lagrangian_multiplier_lr=0.1, )

print("Before update:") print(f"Learning Rate: {lagrange.lambda_optimizer.param_groups[0]['lr']}")

lagrange.update_lagrange_multiplier(ep_cost)

learn_lag = lagrange.lagrangian_multiplier

nu += nu_lr * (ep_cost - 25.0)

print(f"Lagrange multiplier: {learn_lag}") print(f"Nu: {nu}")

There are two methods: one is to treat it as a learnable parameter and use the Adam optimizer, and the other is to directly assign values. Due to the adaptive learning rate of the Adam optimizer, the multipliers obtained by the two are not consistent, and the multiplier changes rapidly in the case of direct assignment under the same learning rate.

I understand that the original code of PPO Lagrangian uses the former method, while the original code of FOCOPS and CUP seems to use the latter method. Should it be distinguished or can the former be used uniformly?

Gaiejj commented 8 months ago

Why do FOCOPS and CUP also utilize the Adam optimizer? Given that both CUP and FOCOPS, as first-order optimization algorithms, also have a substantial dependence on hyperparameters, we believe that implementing the Adam optimizer as opposed to the original SGD optimizer could provide a smoother operation, thereby enhancing the algorithm's performance.

As for supporting the original implementation in the future we're considering introducing the original implementation, that is, the SGD optimizer, as an option in our code and will disclose the ablation test results to the community while updating our code accordingly.

lijie9527 commented 8 months ago

In this case, can I consider that FOCOPS and CUP have no difference in handling cost constraints compared to Lagrangian methods such as PPO-Lagrangian, and their main difference is the way of updating actors?

Gaiejj commented 8 months ago

Sure, in code implementation, these three algorithms bear striking similarities. Their difference, indeed, lies solely in the actor-update process.

lijie9527 commented 8 months ago

The last question is about TRPO class algorithms, such as TRPO, TRPO-Lagrangian, CPO, should they use multiple epochs of full batch or multiple epochs of mini batch to update the critic networks, I found that most of the TRPO class algorithms on the internet use multiple epochs of full batch to update the critic, while most of the PPO class algorithms utilize mini batch. In your implementation, you uniformly use multiple epochs of mini-batch to update the critic, is it because it is more effective and fair to try to preserve the comparison with the first-order methods of the PPO class?

Also, I found that using multiple epochs of full batch to update the critic of TRPO-like algorithms, the training time is much faster than multiple epochs of mini-batch because the number of updates is much lower, is it possible to adopt multiple epochs of full batch to update the critic in TRPO-based algorithms?

Gaiejj commented 8 months ago

In our implementation process, we've referred to Tianshou and Stable-Baselines and have employed multiple mini-batches for multiple rounds of critic updates. We've experimented with using full-batch updates for the critic in previous tests, but its performance didn't quite measure up to the mini-batch approach.

lijie9527 commented 8 months ago

I will further verify the effectiveness of updating the critic network for full batch, and thank you very much for your patient answer, which has solved my long-standing question.