Closed guanjiayi closed 1 year ago
We express our delight in your proactive implementation of the novel algorithm and extend our gratitude for your contributions to the advancement of safe reinforcement learning. Your implementation is commendably aligned with our stipulated criteria for the off-policy version of the CRPO algorithm. However, there are areas that warrant refinement as follows:
Given the requisite adherence to the prescribed limit of constraint violation for an entire episode in the navigation task within the Safety-Gymnasium framework, it is recommended that the
if (loss_c.max().item()>self._cfgs.algo_cfgs.cost_limit + self._cfgs.algo_cfgs.tolerance and
loss_c.mean().item()>loss_r.mean().item()):
should be :
if ep_cost>self._cfgs.algo_cfgs.cost_limit + self._cfgs.algo_cfgs.tolerance:
where ep_cost
is the episodic cost value obataned from logger.
Furthermore, it is advised that greater emphasis be placed on the provision of more comprehensive documentation, and the inclusion of performance curves pertinent to the relevant algorithm.
Prior to submission, we encourage the execution of make pre-commit
and make test
commands at the root directory to ensure the adherence of the codebase to the established standards of OmnniSafe.
These suggestions are expected to significantly enhance the quality of your CRPO implementation. Should any queries or uncertainties arise, please feel free to engage in a discourse with us.
Thank you for your reply, and we also extend our sincere gratitude for your valuable suggestions.
Required prerequisites
Motivation
Solution
Alternatives
No response
Additional context
No response