Q-learning operator selection scheme

leonlan commented 1 year ago

Q-learning has been used in Karimi-Mamaghan et al. (2021) and Karimi-Mamaghan et al. (2023) to select perturbation operators for the traveling salesman problem and the permutation flowshop problem, respectively. See the book on Reinforcement learning by Sutton and Barto (2018) for more info.

N-Wouda commented 1 year ago

It'll be interesting to see how such an approach compares to the bandit scheme we already have.

leonlan commented 1 year ago

I'm planning to implement this probably somewhere next year.

N-Wouda commented 1 year ago

I'm about to showcase my ignorance, but.. how would this work? Does it learn as it makes choices, like $\alpha$-UCB? Or do we need to do something pre-trained?

I'm pretty much a noob when it comes to these fancy ML things. So this might be obvious, but I don't know it :-).

leonlan commented 1 year ago

I'm about to showcase my ignorance, but.. how would this work? Does it learn as it makes choices, like α-UCB? Or do we need to do something pre-trained?

I'm pretty much a noob when it comes to these fancy ML things. So this might be obvious, but I don't know it :-).

It's like segmented $\alpha$-UCB? It's been a while since I read Karimi-Mamaghan's paper. But essentially we have episodes (i.e., segments) during which an operator pair is used for the episode length. Based on the performance (which can be outcome-based but I think Q-learning uses actual objective gains), the operator scores are updated. Operator pairs with the highest score (aka Q-value) are selected, but with $\epsilon$ probability a random operator pair is selected.

I might be wrong on some details, but this is what I remember from refactoring the paper's code.

The version I'm describing is the online, simple version by the way. But this can be extended to become trained offline as well, e.g., deep Q-learning. But that goes way beyond what I know 😅

N-Wouda commented 1 year ago

OK, I think I understand now. Sounds good. I wonder what the performance difference of our acceptance criteria and operator selection schemes is: does it really matter what we do? Or does anything work more or less equally well?

I know that Santini and others (2018) wrote a paper comparing acceptance criteria for ALNS. But they kept the operator selection scheme really simple. There's a paper in here somewhere, where we just vary everything on a few different problem settings, and see what works better. Perhaps there are some general recommendations we can distil from such an exercise.

leonlan commented 1 year ago

I wonder what the performance difference of our acceptance criteria and operator selection schemes is: does it really matter what we do? Or does anything work more or less equally well?

I also wonder about the same. I think there's a lot of interesting questions to be asked about the operator selection schemes in ALNS. These days it's very common in the ALNS-VRP literature to make an ALNS heuristic with many destroy and repair operators, sometimes even exceeding 10 operators in total. Does that even make sense?

I keep bringing it up again - much to my own annoyance - the slack-induced substring removal (SISR) achieves state-of-the-art performance with a single removal and repair operator. My limited experiments showed that adding a random destroy operator to SISR actually decreases performance, maybe because it's just much less efficient than the string-removal operator?

With Q-learning, Karimi-Mamaghan et al. (2023) gets solutions that achieve ~0.05% lower on the average best known solution gaps compared a version of iterated greedy that uses RandomSelect. It's not a lot. But to really test the influence of operator selection schemes, we need much larger scale experiments, perhaps for TSP, CVRP, and PFSP. These libraries are all relatively easy to make I think and in part I already have. It would be a nice addition to our ALNS library and also a nice playground to test all kinds of research questions, including the operator selection schemes. There's many things that we can do for next year(s) :-)

N-Wouda commented 1 year ago

A paper I've been thinking about for at least five years now is more or less this:

Take an absolute state-of-the-art metaheuristic for some problem. Say, Vidal's HGS-variants for a VRP problem.
Remove stuff from it and see whether it's still good. Keep removing stuff until you're left with a core of critical components.
Explain/reason/theorise why those components matter, and why all the novel 'fluff' is not, actually, needed to achieve very good performance.

That's the kind of paper you can write once, and then nobody wants to work with you any more 😆. But there's value in it, because I'm absolutely convinced that a lot of recent 'innovation' is just fluff.

N-Wouda commented 1 year ago

There's many things that we can do for next year(s) :-)

That's for sure! I'm not sure how much of it fits my research plans for the next few years, but there's a ton of critical papers here that you/we/others can write.

N-Wouda commented 1 year ago

Do we still want this now that we have the MAB operator selection schemes? Those already bring a lot of the 'learning value'. We should probably first explore what's all in MABWiser before building new operator selection schemes.

leonlan commented 1 year ago

We should probably first explore what's all in MABWiser before building new operator selection schemes.

Agreed!

leonlan commented 1 year ago

Not planning to work on this anymore. MABWiser can support this way better.

N-Wouda / ALNS

Q-learning operator selection scheme #114