ammarhydr / SAC-Lagrangian

PyTorch implementation of Constrained Reinforcement Learning for Soft Actor Critic Algorithm
29 stars 7 forks source link

If it is necessary to apply extra critic nteworks for evaluating 'safety Q value' ? #1

Open ZhihanLee opened 2 years ago

ZhihanLee commented 2 years ago

Hello, Dr.Haydari. I am an undergraduate student engaging in safe RL, and I also tried to implement CSAC/SAC-Lagrangian in pytorch. I was wondering : ① if it is necessary to apply extra critic networks for 'safety Q value', does it has better performance than constructing actor loss by the cost from off-policy data? ②Have you ploted the lambda training curve? I experienced a monotonic training curve, which is just raise (positve loss) or decend (negative loss), I have noticed that some paper adjust the gradient ascent with max(0, lambda) I would appreciate it if you could help me.

ammarhydr commented 2 years ago

Thank you for the questions. I am also a safe-RL learner. Here are my comments about your questions. 1-) There are two contained RL methods in general. First, peak constraint RL which deals with the constraints on the reward function itself, the other method is the average constraint RL which tries to minimize the cost with extra value function while trying to maximize the reward. So for average constrain formulation yes it is required. I did not get you what do you mean by "actor loss by the cost from off-policy data"

2-) I have not inspected the change of the lambda to be honest, but with little modification on my code, u can also inspect the lambda value. The reason for doing max(0, lambda) for lagrangian optimization is to keep the Lamba in a positive scale but again I have to work on it to give you a proper answer. These days I am busy with other stuff.

ZhihanLee commented 2 years ago

Thank you so much for your reply. Maybe "constructing the critic loss by the cost from off-policy data" is proper. 'the cost' is the 'ci' in each step (i is the constraint number). Let me reorganize my words. The reason why I say that is because we adopt an extral critic network to get safety value now, thus we get the actor loss as : alphalog(pi) - Q_critic + Q_safety, and the critic loss has two types (Q_critic and Q_safety, they are all the distance between the Q prediction and real Q value coming from sampled data). However, I think the Q_safety can be perhaps replaced by the cost that we collected before, which means there is no Q_safety in actor loss. And we add the consideration of safety into the critic loss, the critic loss is now equal to the distance between the network prediction and (a real Q value minus lambdacost), the latter one only depends on the sampled data. Just like the SAC with automatic temperature adjustment, it adjusts alpha without extra network. M a new guy with safe RL, and hoping to receive your suggestions.

ammarhydr commented 1 year ago

Hello ZhininLee,

Thank you for the questions. I am also a safe-RL learner. Here are my comments about your questions. 1-) There are two contained RL methods in general. First, peak constraint RL which deals with the constraints on the reward function itself, the other method is the average constraint RL which tries to minimize the cost with extra value function while trying to maximize the reward. So for average constrain formulation yes it is required. I did not get you what do you mean by "actor loss by the cost from off-policy data"

2-) I have not inspected the change of the lambda to be honest but with little modification on my code, u can inspect the lambda value as well. The reason for doing max(0, lambda) for lagrangian optimization is to keep the Lamba in positive scale but again I have to work on it to give you a proper answer. These days I am busy with other stuff.

I hope this helps you. Ammar

On Fri, Jul 8, 2022 at 11:35 PM ZhihanLee @.***> wrote:

Hello, Dr.Haydari. I am an undergraduate student engaging in safe RL, and I also tried to implement CSAC/SAC-Lagrangian in pytorch. I was wondering : ① if it is necessary to apply extra critic networks for 'safety Q value', does it has better performance than constructing actor loss by the cost from off-policy data? ②Have you ploted the lambda training curve? I experienced a monotonic training curve, which is just raise (positve loss) or decend (negative loss), I have noticed that some paper adjust the gradient ascent with max(0, lambda) I would appreciate it if you could help me.

— Reply to this email directly, view it on GitHub https://github.com/ammarhydr/SAC-Lagrangian/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMOULIOHWWB35F6TZ3BXD3VTEMUNANCNFSM53C4UDWQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Ammar Haydari PhD Student UC Davis