VROOM-Project / vroom

Vehicle Routing Open-source Optimization Machine
http://vroom-project.org/
BSD 2-Clause "Simplified" License
1.34k stars 336 forks source link

Using GPU? #530

Closed dooley closed 3 years ago

dooley commented 3 years ago

We are currently testing the limits with ~ 20,000 jobs, ~ 150 vehicles, time slots and skills. VROOM needs about 9 - 12 hours for the solution.

I wonder if VROOM can benefit from a GPU? I'm not familiar with this area (C++ development) unfortunately, I just found this blog post https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/. If VROOM could use a GPU (for example a Nvidia GeForce), it would bring quite a speed advantage, right? How much would be the effort?

jcoupey commented 3 years ago

My short answer would be: I don't really have a clue on potential speedups.

The longer version is that while it always sounds like a great idea to throw more power in, the combinatorial nature of the problem we solve is such that there is always a point where simply providing more computing power is "useless" (or at least not the most efficient approach).

I love the example of the work of William Cook and his team that computed (and proved optimality for) the shortest possible tour to 49,687 pubs in the UK back in 2018. That took 14 months between data gathering and end of computation, with a total of 250 years of CPU time. Of course the hardest part here is to provide a guarantee of optimality but you get the picture.

There is always a point where scaling will require some more tuning and work on the solving side. For an example closer to VROOM, there is currently no point in having more than 32 cores to speedup the solving process (see #489). Going past this limit would require more internal parallelization, or changing parts of the way we solve the problems.

We already have a few pointers on some internal stuff that would contribute to further improve scaling. Also some work is possible on the modeling side.

Anyway I'd be interested to discuss this huge use-case you have, feel free to ping me directly if you want.

dooley commented 3 years ago

The following (not so fictitious) scenario:

A mail service provider receives up to 20,000 letters at night until about 4 a.m. to a depot, which are to be delivered on the same day. At the moment, the deliveries could only start in the afternoon (~ 2 - 4 p.m.) or be executed the next day, because optimization takes "too long" from the user's point of view.

From the user's point of view, every half hour that the optimization runs shorter is a gain. Hence the idea of "more computing power" through GPU.

Of course, I know the problems with the expotential increase in computing time with the sheer number of jobs. Until a few years ago we were happy to be able to optimize 3 - 500 jobs (ant algorithm with k2-Opt) in a reasonable time, now we have VROOM processing around 6k jobs in 30 minutes ;-)

"Also some work is possible on the modeling side": just now I get the idea, if I can break down the 20k jobs to smaller problems (by clustering) and what comes out as a result in the end.

Otherwise I can only wait what the 3 improvements on the core can bring regarding the optimization times.

senhalil commented 3 years ago

I can attest that even cutting the problem into just 3-4 subproblems would give a very significant speed-up even if the subproblems were to be solved sequentially.

And if multiple vroom binaries could be launched for these 3-4 subproblems in parallel (assuming enough CPU threads), you would probably have the solution less than an hour, maybe much less. I would even just start with 2 clusters (since cutting the problem into two vaguely separate and comparable regions is a more doable operation).


On the subject of GPU parallelisation, GPUs are very good at parallelising certain operation (like matrix multiplication, simple arithmetic over an array, etc) but very bad in others, depending on the size of the object that needs to be distributed.

My gut feeling is that there possibly are some core operations that Vroom does under the hood (like updating all activities of a route after insertion, or calculating the cost of an operator) that can be sped-up by parallelism but I suspect it would require a re-design and if these operations do not make up much of the time then parallelising them wouldn't give back much.

I don't have a clue if it would be possible or not, but if one could parallelise the code so that each possible pair of route and operation (insert, swap, delete) can fit on a single GPU thread and launched in parallel then that would really speed things up!

fernandogonazales commented 2 years ago

Hello every one

Reading your issue maybe you crossed the same problem. I am trying to solve a problem with around 5,000 points and 100 vehicles. I have found that near to 4,500 points the server stop. I generated an environment with OSRM and VROOM with and end point that I consult in Python. The environment colaps at this point (with 4,500 points). I don't know where is the problem I increase the size of the matrix in OSRM to 1,000,000. I don't know if the problem is with come settings of VROOM.

I hope you can give me a clue.

Best regards,

fernandogonazales commented 2 years ago

image