Closed liruiluo closed 1 year ago
If that's the case, it looks like the Cnn can't update the parameters because it can't be optimized by the reinforcement learning algorithm
Could you elaborate? I'm not sure to get your point.
(and please use a meaningful title for the issue)
If that's the case, it looks like the Cnn can't update the parameters because it can't be optimized by the reinforcement learning algorithm
Could you elaborate? I'm not sure to get your point.
(and please use a meaningful title for the issue)
Now we have a meaningful title,let's take reinforce algorithm as an example. What I mean is that if cnn and mlp are two independent networks instead of a network connected together, mlp can be updated, because it outputs actions, and it is easy to use reinforcement learning algorithms, such as reinforce, to calculate loss ( -log_prob * Gt), so as to calculate the gradient.
However, cnn will only output a visual representation, and this representation does not have a real value, (such as loss=(predict, real)), so the loss cannot be calculated. Thus the gradient cannot be calculated, resulting in the network cannot be updated.
On the contrary, when cnn and mlp are connected to a network, cnn and mlp can be updated together(the parameters of cnn and mlp can be updated directly through the reinforce gradient together), and this problem does not exist.
On the contrary, when cnn and mlp are connected to a network, cnn and mlp can be updated together(the parameters of cnn and mlp can be updated directly through the reinforce gradient together), and this problem does not exist.
yes and that's the case in SB3. CNN vs MLP policy in SB3 mostly refers to the "features extractor" that can be shared or not between actor/critic. As shown in the doc, each network is decomposed in two parts: features extractor + mlp (for CNN policy, just a linear layer by default, can be adjusted), both are learned.
❓ Question
This is a great job! However, I have a little doubt about the custom policy After reading the documentation of the module "custom policy", it seems that Cnn and Mlp are not directly connected together as a network, but two networks are spliced together If that's the case, it looks like the Cnn can't update the parameters because it can't be optimized by the reinforcement learning algorithm Or is it not what I thought?
Checklist