DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.66k stars 1.65k forks source link

Best practice of introducing constraints to a gym environment #1433

Closed AminDar closed 1 year ago

AminDar commented 1 year ago

❓ Question

I'm aware of MaskedPPO but I'm not sure if there is a much cleaner work around with the current gym environment (and not safety gym) to implement the custom gym on SB3.

I'm considering a minimization problem in which the reward can be defined as : -(c_1 + c_2) # where C is the production cost

The agent takes action in box/discrete action space and calculates c_1 and c_2 . Generally, with this set up the agent would try to reach reward = 0 since this is utopia, but this will lead to zero production (when agent takes actions which doesn't produce anything then the costs (C1 and C2) are zero). Obviously this shouldn't be the case.

My question is how we can add such constrains to gym which can be compatible by SB3, so far I added the above mentioned constrain to reward by doing so: (Total_production - Target_production)**2 - (C1+C2)

Which works perfectly fine. But defining reward function gets much more complicated by adding more constrains ( such as Min and maximum allowed production , etc). If it would be a Mixed Integer optimization you would have a minimization function (reward) subjective to couple of constraints to solve for example :

image

I wonder what would be the best practice to implement such a constrains to the SB3 and gym to tell the agent that a few observations are fine (mathematically possible) but don't make sense! or are off limits.

-- An example of using RL to solve such problems can be found in this paper download link

Checklist

Pythoniasm commented 1 year ago

I don't exactly understand your problem by having the "lagragian" approach of adding your constraints as weighted inequality term reformulations to the objective function to optimize for. This is commong practice, even if there are more sophisticated approaches with respect to nonlinear reward functions which can not be solved by pure gradient based optimization easily. RL can do better here, but so can other algorithms like evolutionary, or particle swarm.

However, I try to imply that you meant something like "what else alternatives are there" and "how to implement that"!?

You can give sparse rewards:

if x + 1 <= 0:   # reward to remain inside an inequality constraint
    reward += 1000

if y - 1 <= 0:  # penalize to break an inequality constraint
    reward -= 1000

if  abs(z) - episolon/2 <= 0 :   # do similiar stuff for equality constraints around a "threshold" corridor of width epsilon
    ...

Though, you can also regard constraints as barrier functions that push or pull with respect to your constraint definition. That means you have a continuous reward with respect to the distance of your violated/obeyed constraint.

d = x - c
if d > 0:  # Penalize if beyond an inequality constraint
    reward -= d

If you meant, instead, how to efficiently implement a long list of constraint, I'd vote for deciding how to reward/penalize first and then create inequality constraints (lambda) functions which you can parametrize according to your specific needs:


sparse_inequality_constraint_reward = lambda x, c, r: r if x - c<= 0 else 0

reward = ...
reward += sparse_inequality_constraint_reward (x, 10, 1000)
... # repeat according to your needs and extend with sparse/dense inequality/equality constraint functions.

Remind yourself, that with respect to traditional optimization, you should in best practice re-formulate all your specific constraints to inequality and equality constraints in the form of f(x, ...) <=0 and g(x, ...) = 0. Meaning to switch sides of "greater than" formulations and make a double-sided constraints to two single-sided "smaller than" constraint.

Beyond this very basic stuff, you should definitely check out robotics paper, they introduce crazy nonlinear constraints with dense/sparse rewards to create Mujoco environment humanoid backflips and so on.

AminDar commented 1 year ago

Thank you for the reply. I believe, at least in my case, the first approach with sparse rewards is not doing the tricks. From my perspective, the problem with extremely high/low rewards (penalty) is that the agent lands in local optima just to avoid such a punishment.

The second suggesting is what I'm using and mentioned above by an inequality constraint. But I believe my life and the agent's live would be hard by having 10-15 constraints if we want to implement that.

Recently I defined couple of function like:

` def check_constrains_1(self, x, y, z):

    if 0.8 * x < y < 1.3 * z:
        penalty = 0
    else:
         penalty = 1`

and tried to then use the number of violated constrains as penalty and redefining the reward:

reward = -(C1+C2 + k * (n_constrains)) with K positive number. Would you thing this type of defining constraints is practical?