Closed leonlan closed 1 year ago
It'll be interesting to see how such an approach compares to the bandit scheme we already have.
I'm planning to implement this probably somewhere next year.
I'm about to showcase my ignorance, but.. how would this work? Does it learn as it makes choices, like $\alpha$-UCB? Or do we need to do something pre-trained?
I'm pretty much a noob when it comes to these fancy ML things. So this might be obvious, but I don't know it :-).
I'm about to showcase my ignorance, but.. how would this work? Does it learn as it makes choices, like α-UCB? Or do we need to do something pre-trained?
I'm pretty much a noob when it comes to these fancy ML things. So this might be obvious, but I don't know it :-).
It's like segmented $\alpha$-UCB? It's been a while since I read Karimi-Mamaghan's paper. But essentially we have episodes (i.e., segments) during which an operator pair is used for the episode length. Based on the performance (which can be outcome-based but I think Q-learning uses actual objective gains), the operator scores are updated. Operator pairs with the highest score (aka Q-value) are selected, but with $\epsilon$ probability a random operator pair is selected.
I might be wrong on some details, but this is what I remember from refactoring the paper's code.
The version I'm describing is the online, simple version by the way. But this can be extended to become trained offline as well, e.g., deep Q-learning. But that goes way beyond what I know 😅
OK, I think I understand now. Sounds good. I wonder what the performance difference of our acceptance criteria and operator selection schemes is: does it really matter what we do? Or does anything work more or less equally well?
I know that Santini and others (2018) wrote a paper comparing acceptance criteria for ALNS. But they kept the operator selection scheme really simple. There's a paper in here somewhere, where we just vary everything on a few different problem settings, and see what works better. Perhaps there are some general recommendations we can distil from such an exercise.
I wonder what the performance difference of our acceptance criteria and operator selection schemes is: does it really matter what we do? Or does anything work more or less equally well?
I also wonder about the same. I think there's a lot of interesting questions to be asked about the operator selection schemes in ALNS. These days it's very common in the ALNS-VRP literature to make an ALNS heuristic with many destroy and repair operators, sometimes even exceeding 10 operators in total. Does that even make sense?
I keep bringing it up again - much to my own annoyance - the slack-induced substring removal (SISR) achieves state-of-the-art performance with a single removal and repair operator. My limited experiments showed that adding a random destroy operator to SISR actually decreases performance, maybe because it's just much less efficient than the string-removal operator?
With Q-learning, Karimi-Mamaghan et al. (2023) gets solutions that achieve ~0.05% lower on the average best known solution gaps compared a version of iterated greedy that uses RandomSelect. It's not a lot. But to really test the influence of operator selection schemes, we need much larger scale experiments, perhaps for TSP, CVRP, and PFSP. These libraries are all relatively easy to make I think and in part I already have. It would be a nice addition to our ALNS library and also a nice playground to test all kinds of research questions, including the operator selection schemes. There's many things that we can do for next year(s) :-)
A paper I've been thinking about for at least five years now is more or less this:
That's the kind of paper you can write once, and then nobody wants to work with you any more 😆. But there's value in it, because I'm absolutely convinced that a lot of recent 'innovation' is just fluff.
There's many things that we can do for next year(s) :-)
That's for sure! I'm not sure how much of it fits my research plans for the next few years, but there's a ton of critical papers here that you/we/others can write.
Do we still want this now that we have the MAB operator selection schemes? Those already bring a lot of the 'learning value'. We should probably first explore what's all in MABWiser before building new operator selection schemes.
We should probably first explore what's all in MABWiser before building new operator selection schemes.
Agreed!
Not planning to work on this anymore. MABWiser can support this way better.
Q-learning has been used in Karimi-Mamaghan et al. (2021) and Karimi-Mamaghan et al. (2023) to select perturbation operators for the traveling salesman problem and the permutation flowshop problem, respectively. See the book on Reinforcement learning by Sutton and Barto (2018) for more info.