关于该repo的一些疑问

ldy8665 commented 4 years ago

花了几天时间通读了代码，并做了一下复现，有一些疑问： 1.首先吐槽一下代码，代码很像一个对torch不熟悉的人写的。里面为了实现batch并行处理的目的，用了很多for循环，很多地方写的好naive。。。 2.所以这个代码真的是你们的最终版嘛？很多路径和文件不应该自己创建嘛？run了以后才知道要手动去自己建一些文件和路径？真的是认真的嘛？ 2.关于训练时间，由于你代码用了很多while和for的嵌套，虽然看上去想实现batch并行，但本质却没有反而增加了代码的复杂度，所以我很好奇你们论文的效果训练了多久？我在5条2080ti的服务器上测试了一下vrp，就算是64的batch_size，train的真的很慢，而且前期波动性就很大。这很不符合用深度方法做这这类问题的前期的表现（下面的repo与你们对比鲜明）。如果按你们default的参数设置，10w个点，200epoch，难道真要训练好几个月？ 3.可以看一下去年NIPS的《reinforcement-learning-for-solving-the-vehicle-routing-problem》和19的ICLR《ATTENTION, LEARN TO SOLVE ROUTING PROBLEMS!》的代码和实验效果，他们的复现结果都很好，虽然方法不一样。 4.所以最后还是想问一句，这真的是你们论文结果的代码吗？

LucasBoTang commented 4 years ago

Hello @ldy8665, sorry for using English. I am using the computer in the library so I can only type English. I want to know did you reproduce any result of vrp or jsp? The training for vrp is also very slow in my workstation. Now I am trying to train for the jsp task but it does not converge in one epoch with 64 batch size and 5e-4. How many iters should the rewriter run to improve the result a lot?

LucasBoTang commented 4 years ago

@ldy8665 By the way, I modified the code to create the directory when it does not exist. Now you do not need to create folders by hand before running the code :)

ldy8665 commented 4 years ago

Hi Bo, I only check the vrp task of your repo.Because im doing reaserach on the vrp problem with RL/DRL. As the code is too slow when i run it to train vrp task, so i stop it early in several epoches. Now im trying to rewrite the code base your main idea but improve the framework of the code. You can try 'DataLoader','torch.gather' in pytorch or something like these function. I think it can help instead of 'for' in 'for'.

LucasBoTang commented 4 years ago

Hi, I am not the author of this paper. In fact, I am an open-source contributor who modified some part of the code to make it better. Same as you, I also try to dive into this code and see some results but have some trouble. Maybe we can talk about this by email (lucastang1994@gmail.com). By the way, I also noticed some operations in code is not very efficient. If you improve the framework, it is also great to pull the request to contribute to this repo.

ldy8665 commented 4 years ago

Ok Bo.when i finish my code and have some results,I`ll share with you.

yuandong-tian commented 4 years ago

Hi all, thanks for the interest and sorry for all the inconvenience.

For vehicle routing, we use a different version of the sampling code in PyTorch, which might cause some slowness. Xinyun mentioned that the training code takes ~10h to achieve a good performance with 8 GeForce GTX 1080 GPUs, after training for slightly more than an epoch (note that you don't need to train a lot of epochs to get good performance).

zlw21gxy commented 4 years ago

Any progress there? Hope we can add some visualization codes and make the training more efficient

LucasBoTang commented 4 years ago

Hi all, I succeeded to reproduce the result. According to my experience, I think everything goes well. The only problem is the outputs and logs for loss and reward in training are not instructive for some reason. So even you cannot see the convergence and good performance during training, it is fine to stop training after slightly more than one epoch and just run the evaluation.

About visualization, I believe it is easy to use tensorboad to get some visualization. I have plan to do it in my branch when I have time.

ldy8665 commented 4 years ago

I think there are some mismatches between the code and paper in loss function.

LucasBoTang commented 4 years ago

I think there are some mismatches between the code and paper in loss function.

Yes, I agree. For example, the loss of value approximation in the paper is l2 but in code, it is smooth l1.

DingShizhe commented 3 years ago

May I ask, how does this code run on multiple GPUs？

facebookresearch / neural-rewriter

关于该repo的一些疑问 #4