DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
2.08k stars 516 forks source link

[Question] About the parameter update of CNN #332

Closed liruiluo closed 1 year ago

liruiluo commented 1 year ago

❓ Question

This is a great job! However, I have a little doubt about the custom policy After reading the documentation of the module "custom policy", it seems that Cnn and Mlp are not directly connected together as a network, but two networks are spliced together If that's the case, it looks like the Cnn can't update the parameters because it can't be optimized by the reinforcement learning algorithm Or is it not what I thought?

Checklist

araffin commented 1 year ago

If that's the case, it looks like the Cnn can't update the parameters because it can't be optimized by the reinforcement learning algorithm

Could you elaborate? I'm not sure to get your point.

(and please use a meaningful title for the issue)

liruiluo commented 1 year ago

If that's the case, it looks like the Cnn can't update the parameters because it can't be optimized by the reinforcement learning algorithm

Could you elaborate? I'm not sure to get your point.

(and please use a meaningful title for the issue)

Now we have a meaningful title,let's take reinforce algorithm as an example. What I mean is that if cnn and mlp are two independent networks instead of a network connected together, mlp can be updated, because it outputs actions, and it is easy to use reinforcement learning algorithms, such as reinforce, to calculate loss ( -log_prob * Gt), so as to calculate the gradient.

However, cnn will only output a visual representation, and this representation does not have a real value, (such as loss=(predict, real)), so the loss cannot be calculated. Thus the gradient cannot be calculated, resulting in the network cannot be updated.

On the contrary, when cnn and mlp are connected to a network, cnn and mlp can be updated together(the parameters of cnn and mlp can be updated directly through the reinforce gradient together), and this problem does not exist.

araffin commented 1 year ago

On the contrary, when cnn and mlp are connected to a network, cnn and mlp can be updated together(the parameters of cnn and mlp can be updated directly through the reinforce gradient together), and this problem does not exist.

yes and that's the case in SB3. CNN vs MLP policy in SB3 mostly refers to the "features extractor" that can be shared or not between actor/critic. As shown in the doc, each network is decomposed in two parts: features extractor + mlp (for CNN policy, just a linear layer by default, can be adjusted), both are learned.