For my Master's thesis, I'm looking to use the MP-DQN algorithm in order to solve a problem. However, I'd like to use a dueling network to speed up convergence. With the regular P-DQN algorithm, this was quite easily accomplished, but it is not as intuitive with the MP-DQN algorithm. For the advantage function, the diagonal of the output of the multi-pass works as expected since it refers to the numbers of actions, but it does not make sense for the value function, since that is not dependent on the action parameters, but just a single value representing the value of the state.
There are a few things that could be done. One thing is to do as the P-DQN and have the value function part of the dueling network be dependent on all action parameters. Another thing is to have it be dependent on no action parameters, but this requires the dueling network to be split up more (since the input dimensions for the first fully connected layer would be different for the advantage function and the value function). Finally, one could still perform the multi-pass on the value part, and take the mean of the outputs or something similar.
I don't really know what the smartest choice here is. Intuitively, it feels as if the value function should have nothing to do with the action parameters. Given your knowledge with the MP-DQN, do you have any ideas as to what might be a smart solution?
Thank you! (Also thank you very much for making your code open-sourced!)
Hi!
For my Master's thesis, I'm looking to use the MP-DQN algorithm in order to solve a problem. However, I'd like to use a dueling network to speed up convergence. With the regular P-DQN algorithm, this was quite easily accomplished, but it is not as intuitive with the MP-DQN algorithm. For the advantage function, the diagonal of the output of the multi-pass works as expected since it refers to the numbers of actions, but it does not make sense for the value function, since that is not dependent on the action parameters, but just a single value representing the value of the state.
There are a few things that could be done. One thing is to do as the P-DQN and have the value function part of the dueling network be dependent on all action parameters. Another thing is to have it be dependent on no action parameters, but this requires the dueling network to be split up more (since the input dimensions for the first fully connected layer would be different for the advantage function and the value function). Finally, one could still perform the multi-pass on the value part, and take the mean of the outputs or something similar.
I don't really know what the smartest choice here is. Intuitively, it feels as if the value function should have nothing to do with the action parameters. Given your knowledge with the MP-DQN, do you have any ideas as to what might be a smart solution?
Thank you! (Also thank you very much for making your code open-sourced!)