Hanjun-Dai / graph_comb_opt

Implementation of "Learning Combinatorial Optimization Algorithms over Graphs"
https://arxiv.org/abs/1704.01665
MIT License
496 stars 135 forks source link

Understanding TSP2D #8

Open jasontlam opened 6 years ago

jasontlam commented 6 years ago

Hello,

I was going through the code for s2v_tsp2d and was wondering if you could clarify something. When you call the predict function you use arg_max to select the best node even though it is a minimization problem. I also tried setting int sign = -1; in tsp2d_env and still got minimum tour lengths. Could you explain where the minimization is occurring ?

Thank you for your time !

yywe commented 6 years ago

I guess part of reason is because of the insert strategy. The arg_max function will choose the next vertex to grow, then the selected vertex will be inserted into the sequence in somewhere lead to the minimal increase of the total tour length.

jasontlam commented 6 years ago

But since we get the same results by flipping the sign of the reward does it not mean we are not really learning the policy ? When the nodes are inserted in a place where it always minimizes the distance, it seems like the node we are picking does not really matter.

yywe commented 6 years ago

the arriving order of the next vertex matters according to my experiments. Say you have k nodes, each time one node arrives and you insert it. Different arriving order will lead to different order after the insert operation. The Q network here just learning which vertex to pick next, as for the inner order after picking the vertex, it is maintained by the insert function.

Hanjun-Dai commented 6 years ago

thanks for @stweiyy 's reply. Yes we have external maintainer for constructing the tour. Some heuristics like random insertion, closest insertion are all using this maintainer. So we try to see whether we can learn better than those heuristics.

But with this maintainer, the RL becomes a bit weird, since it suggest more on the long term reward -- and that's why we find flipping the reward would also help (since RL typically is biased towards short-term rewards).

fantasy-fish commented 6 years ago

But in the sense of minimizing the expected sum of rewards in the future(which is what Q-value means), it should be negative? right?

Hanjun-Dai commented 6 years ago

No, you are maximizing the expected reward. So flipping the sign is a trick which doesn't quite make sense, but it just works well in this particular case...

fantasy-fish commented 6 years ago

By the way, is it possible to learn a mechanism which predicts the position of the newly added node?

Hanjun-Dai commented 6 years ago

Then it is equivalent to choose the node and append to the end of tour (i.e., without the post-processing helper). It also learns up to some degree, but is not as good as the one with helper.

Amirinezhad commented 5 years ago

Hello,

I have a question, where do you have implemented helper function in code?

Thanks.

Hanjun-Dai commented 5 years ago

What do you mean by 'helper function?'.

For example for tsp2d, all the cpp implementations can be found under: https://github.com/Hanjun-Dai/graph_comb_opt/tree/master/code/s2v_tsp2d/tsp2d_lib

And the python code is just a wrapper for it.

Amirinezhad commented 5 years ago

In the paper (page 6, table 1), there is a helper function in the tsp2d problem that is for inserting operations, is it right? I want to know how you have implemented this. Is it implemented as an explicit c++ function or is it written on other functions?

Hanjun-Dai commented 5 years ago

It is implemented in the environment. The RL agent is not aware of it.

https://github.com/Hanjun-Dai/graph_comb_opt/blob/master/code/s2v_tsp2d/tsp2d_lib/src/lib/tsp2d_env.cpp#L65