Open jasontlam opened 6 years ago
I guess part of reason is because of the insert strategy. The arg_max function will choose the next vertex to grow, then the selected vertex will be inserted into the sequence in somewhere lead to the minimal increase of the total tour length.
But since we get the same results by flipping the sign of the reward does it not mean we are not really learning the policy ? When the nodes are inserted in a place where it always minimizes the distance, it seems like the node we are picking does not really matter.
the arriving order of the next vertex matters according to my experiments. Say you have k nodes, each time one node arrives and you insert it. Different arriving order will lead to different order after the insert operation. The Q network here just learning which vertex to pick next, as for the inner order after picking the vertex, it is maintained by the insert function.
thanks for @stweiyy 's reply. Yes we have external maintainer for constructing the tour. Some heuristics like random insertion, closest insertion are all using this maintainer. So we try to see whether we can learn better than those heuristics.
But with this maintainer, the RL becomes a bit weird, since it suggest more on the long term reward -- and that's why we find flipping the reward would also help (since RL typically is biased towards short-term rewards).
But in the sense of minimizing the expected sum of rewards in the future(which is what Q-value means), it should be negative? right?
No, you are maximizing the expected reward. So flipping the sign is a trick which doesn't quite make sense, but it just works well in this particular case...
By the way, is it possible to learn a mechanism which predicts the position of the newly added node?
Then it is equivalent to choose the node and append to the end of tour (i.e., without the post-processing helper). It also learns up to some degree, but is not as good as the one with helper.
Hello,
I have a question, where do you have implemented helper function in code?
Thanks.
What do you mean by 'helper function?'.
For example for tsp2d, all the cpp implementations can be found under: https://github.com/Hanjun-Dai/graph_comb_opt/tree/master/code/s2v_tsp2d/tsp2d_lib
And the python code is just a wrapper for it.
In the paper (page 6, table 1), there is a helper function in the tsp2d problem that is for inserting operations, is it right? I want to know how you have implemented this. Is it implemented as an explicit c++ function or is it written on other functions?
It is implemented in the environment. The RL agent is not aware of it.
Hello,
I was going through the code for s2v_tsp2d and was wondering if you could clarify something. When you call the predict function you use
arg_max
to select the best node even though it is a minimization problem. I also tried settingint sign = -1;
in tsp2d_env and still got minimum tour lengths. Could you explain where the minimization is occurring ?Thank you for your time !