Closed N-Wouda closed 1 year ago
We should probably have different configurations for different instance sizes. A natural split seems to be in three parts, like how the quali/final run-times are split.
We also might have to consider separate parameter tuning for the dynamic instances. The dynamic dispatch instances are very different than the ones from the static problem because of the relatively tight time windows and small instance sizes.
Based on greedy/hindsight/rollout, a rough estimate is that 40% of the instances is between 100-150 customers, 40% is between 50-100 customers, 10% <50 customers and 10% 150-200 customers.
We should have different configurations for each of the three instance sizes (of <300, 300-500, >500 clients), since they get different time limits and are in general very different in size and/or shape.
We currently have these parameters (22 in total):
Param | Default value | Description |
---|---|---|
nbIter | 10K | Number of iters without improvement before restarting |
initialTimeWarpPenalty | 1 | |
nbPenaltyManagement | 100 | Number of iterations between penalty updates |
feasBooster | 2 | Penalty increase if no feasible solutions yet |
penaltyIncrease | 1.2 | Penalty increase if insufficient (but not zero) feasible solutions |
penaltyDecrease | 0.85 | Penalty decrease if too many feasible solutions |
minPopSize | 25 | Minimum population size |
generationSize | 40 | Min + gen size = max pop size before culling |
nbElite | 4 | Number of elite individuals which are biased to survive in survival mechanism |
nbClose | 5 | Number of closest individuals to calculate average diversity |
targetFeasible | 40% | Target percentage of generated pops that were feasible (between penalty updates) |
nbKeepOnRestart | 1 | Number of best feasible pops to keep on restart |
repairProbability | 50% | Probability of attempting to repair infeasible offspring |
repairBooster | 10 | Penalty booster when repairing (hammering out infeasibilities) |
selectProbability | 90% | Geometric selection probability of best crossover offspring |
nbGranular | 40 | Number of 'near' clients explored by node operators in local search |
weightWaitTime | 2 | Penalty for waiting time when determining granular neighborhoods |
weightTimeWarp | 10 | Penalty for time warps when determining granular neighborhoods |
circleSectorOverlapTolerance | 0 degrees | Overlap percentage before route operators are applied |
minCircleSectorSize | 15 degrees | Minimum overlap percentage (overlap is bounded from below by this) |
destroyPct | 20% | Percentage of the solution to destroy in broken pairs exchange crossover |
postProcessPathLength | 4 | Subpath length to optimally recombine by enumeration in local search postprocessing |
Additionally, if we have to tune the node, route, and the crossover operators (another 10-15 or so) as well, we have maybe 40 parameters in total. That's quite a lot. Can we shorten this list somehow, or do you know of a tuning algorithm that can deal with this? I'm not sure SMAC can.
This probably also needs its own static configuration, particularly for the simulations (since those need to be fast). The configuration for the epoch optimisation after rollout can probably be shared with that for the small instances.
Rollout needs a static parameter configuration (for the simulations), and the following additional parameters:
Param | Default value | Description |
---|---|---|
N_LOOKAHEAD | 3 | Number of periods to simulate ahead |
SIM_TLIM_FACTOR | 50% | Percentage of the epoch time limit to take for simulating |
SIM_SOLVE_ITERS | 25 | Number of static solver iterations to use in a simulation |
DISPATCH_THRESHOLD | 15% | Threshold to clear before dispatching a customer in current epoch |
Each parameter lies in some [min, max] range. One simple way to get a feel for this would be to run two benchmarks for each parameter value: one where we increase it from the default to the max value, and one where we decrease it to the min value. That's doable, and should quickly give us a feel for which parameters really matter, and which do not seem to be as relevant. Then we can take the subset of parameters that really matter and tune those together.
@leonlan @LuukPentinga @jaspervd96 @rijalarpan @MarjoleinAerts @sbhulai: what do you think?
@leonlan @LuukPentinga @jaspervd96 @rijalarpan @MarjoleinAerts @sbhulai: what do you think?
Each parameter lies in some [min, max] range. One simple way to get a feel for this would be to run two benchmarks for each parameter value: one where we increase it from the default to the max value, and one where we decrease it to the min value. That's doable, and should quickly give us a feel for which parameters really matter, and which do not seem to be as relevant. Then we can take the subset of parameters that really matter and tune those together.
It is a might load of parameter to tune. I think the min and max idea may not work fully, because in my experience some of them have non-linear behaviour. I would consider a set of values for each parameter and test in a full factorial way. Then as you suggested to go deeper into few of them. I do not think a parameter optimization packages like the ones available for machine learning will work here.
I agree with @rijalarpan; these parameters are usually non-linear in behavior (e.g., destroyPct), so we need more than min-max. A factorial design should be manageable and I can also help to tune part of the paramters. Some other thoughts
Also I filled in the remaining ??
for some parameters (nbElite, nbClose, weightTimeWarp, weightWaitTime).
A factorial design should be manageable
We have 40+ parameters. How is even the simplest factorial design (roughly 2^40) manageable?
A factorial design should be manageable
We have 40+ parameters. How is even the simplest factorial design (roughly 2^40) manageable?
It's not 😅 My understanding of the (full) factorial design was flawed. Let me rephrase. A full factorial design is not feasible, but it should be manageable to perform a factorial design for smaller logical parameter groups.
Agree with the comments to check a couple of values for the parameters and then focus on the ones with the most impact. Min and max might not give good indications, as also suggested in the comments above
A factorial design should be manageable
We have 40+ parameters. How is even the simplest factorial design (roughly 2^40) manageable?
It's not 😅 My understanding of the (full) factorial design was flawed. Let me rephrase. A full factorial design is not feasible, but it should be manageable to perform a factorial design for smaller logical parameter groups.
Could the procedure comparable to variable neighborhood depth be pragmatic here? It would follow a similar procedure:
Then, you can try to see if a full-factorial set up of a limited set of parameters that proved of greatest value from the above process and focus on them.
Here are our current parameters and operators organized per category. I like the suggestion by @rijalarpan to do a VND kind of approach to optimize this, but I think its best done per parameter group instead of 1 single parameter. Those parameter group sizes are not super big, so within each (sub)group you could do a full-factorial approach.
Added three more node ops that we keep forgetting about :-).
I like the idea of doing things by logical group. If we want to do a factorial design we need to come up with levels for the non-binary parameters (e.g. what's 'low' or 'high' for nbGranular?).
We might also just dump every group into, say, SMAC or ParamILS and have that thing tune for us. The number of parameters we have in each group is manageable for these algorithms, so I suspect they can get us a reasonably good configuration quite quickly. If the resulting configuration of tuning each group in isolation is not already effective, we can use the tuning trajectories to identify which parameters should be tuned together in a follow-up step.
@rijalarpan @leonlan [and anyone else that's interested] shall we have a short meeting tomorrow at, say, 15h to discuss this specific subject further? I can do anytime tomorrow, except 14-15h.
@rijalarpan @leonlan [and anyone else that's interested] shall we have a short meeting tomorrow at, say, 15h to discuss this specific subject further? I can do anytime tomorrow, except 14-15h.
I'm available all day before 15:30. So 15:00-15:30 I could meet.
@rijalarpan @leonlan [and anyone else that's interested] shall we have a short meeting tomorrow at, say, 15h to discuss this specific subject further? I can do anytime tomorrow, except 14-15h.
I'm available all day before 15:30. So 15:00-15:30 I could meet.
I am away for holidays but I can find time if between 09:00-10:00.
Tomorrow at 9 works for me. I'll send a Google meet in a bit.
Here's the link: https://meet.google.com/ekp-weba-xci. See you tomorrow!
What we discussed just yet:
After this Friday (14 October) I would like to freeze any new static changes, and focus solely on tuning the static solver. So any planned or open PRs that impact the static solver should ideally be merged before next weekend!
There's a lot of stuff here. Including SMAC, and several other options. I'll have a detailed look tomorrow and make a brief list of algorithms that seem to work well, and that we can get working easily from Python.
A brief update on static tuning (#134):
I think that I can get a (noisy) evaluation of a single configuration down to about the qualification run time, which is more or less the best possible level of parallelisation we can hope for. But that'll require some further work to make happen.
Proposed ranges for each parameter (for now only for the three groups that we want to tune first):
Param | Default value | Min | Max |
---|---|---|---|
initialTimeWarpPenalty | 1 | 1 | 100 |
nbPenaltyManagement | 100 | 25 | 500 |
feasBooster | 2 | 1 | 10 |
penaltyIncrease | 1.2 | 1 | 5 |
penaltyDecrease | 0.85 | .25 | 1 |
targetFeasible | 40% | 0% | 100% |
repairProbability | 50% | 0% | 100% |
repairBooster | 10 | 1 | 25 |
Param | Default value | Min | Max |
---|---|---|---|
minPopSize | 25 | 5 | 100 |
generationSize | 40 | 1 | 100 |
nbElite | 4 | 0 | 25 |
nbClose | 5 | 1 | 25 |
Param | Default value | Min | Max |
---|---|---|---|
nbIter | 10K | 1K | 10K |
nbKeepOnRestart | 0 | 0 | 5 |
Please feel free to remark on these suggested ranges. For some of them I have some intuition, for others I'm mostly guessing.
Parameter ranges look good to me. Is there also a need for step sizes or does SMAC figure that out itself?
Is there also a need for step sizes or does SMAC figure that out itself?
There's the option to indicate this, yes, but SMAC can figure it out on its own as well! Edit: I've decided to do the quantization ourselves. This is the full configuration now (including suggestions for the other parameters):
# Penalty management
Integer("initialTimeWarpPenalty", (1, 25), default=1),
Integer("nbPenaltyManagement", (25, 500), default=100, q=25),
Float("feasBooster", (1, 10), default=2.0, q=0.1),
Float("penaltyIncrease", (1, 5), default=1.2, q=0.1),
Float("penaltyDecrease", (0.25, 1), default=0.85, q=0.05),
Float("targetFeasible", (0, 1), default=0.4, q=0.05),
Integer("repairProbability", (0, 100), default=50, q=5),
Integer("repairBooster", (1, 25), default=10),
# Population management
Integer("minPopSize", (5, 100), default=25, q=5),
Integer("generationSize", (0, 100), default=40, q=5),
Integer("nbElite", (0, 25), default=4, q=5),
Integer("nbClose", (1, 25), default=5, q=4),
# Restart mechanism
Integer("nbIter", (1_000, 10_000), default=10_000, q=250),
Integer("nbKeepOnRestart", (0, 5), default=0),
# Crossover
Integer("selectProbability", (50, 100), default=90, q=10),
Integer("destroyPct", (5, 50), default=20, q=5),
Integer("brokenPairsExchange", (0, 1), default=0),
Integer("selectiveRouteExchange", (0, 1), default=1),
# Node ops
Integer("Exchange10", (0, 1), default=1),
Integer("Exchange11", (0, 1), default=1),
Integer("Exchange20", (0, 1), default=1),
Integer("MoveTwoClientsReversed", (0, 1), default=1),
Integer("Exchange21", (0, 1), default=1),
Integer("Exchange22", (0, 1), default=1),
Integer("TwoOpt", (0, 1), default=1),
Integer("Exchange31", (0, 1), default=0),
Integer("Exchange32", (0, 1), default=0),
Integer("Exchange33", (0, 1), default=0),
# Granular neighborhoods
Integer("nbGranular", (10, 100), default=40, q=5),
Integer("weightWaitTime", (1, 25), default=2),
Integer("weightTimeWarp", (1, 25), default=10),
# Intensification (and route ops)
Integer("shouldIntensify", (0, 1), default=1),
Integer("circleSectorOverlapTolerance", (0, 359), default=0),
Integer("minCircleSectorSize", (0, 359), default=15),
Integer("postProcessPathLength", (1, 8), default=7),
@leonlan @rijalarpan what do you think of my suggested configuration space above? Each field reads as
<param type>(<param name>, (<low>, <high>), default=<default>)
, with optional step size/quantisation q
.
I propose we do not test RELOCATE and SWAP separately: they're not that expensive now that we do intensification only when we find a new best solution. More depends on whether we do intensification in the first place (i.e., whether shouldIntensify
is true or not).
@leonlan are these conditionals correct?
# Only valid parameters when we do, in fact, intensify
EqualsCondition(cs["circleSectorOverlapTolerance"],
cs["shouldIntensify"],
1),
EqualsCondition(cs["minCircleSectorSize"],
cs["shouldIntensify"],
1),
EqualsCondition(cs["postProcessPathLength"],
cs["shouldIntensify"],
1),
# Repair booster only makes sense when we can actually repair
NotEqualsCondition(cs["repairBooster"],
cs["repairProbability"],
0),
# Parameter is specific to BPX
EqualsCondition(cs["destroyPct"],
cs["brokenPairsExchange"],
1),
where
EqualCondition(a, b, val)
: a
is a parameter only when b
has value val
NotEqualsCondition(a, b, val)
: a
is a parameter only when b
does not have value val
Have I missed any?
@leonlan @rijalarpan what do you think of my suggested configuration space above? Each field reads as
<param type>(<param name>, (<low>, <high>), default=<default>)
, with optional step size/quantisationq
.I propose we do not test RELOCATE and SWAP separately: they're not that expensive now that we do intensification only when we find a new best solution. More depends on whether we do intensification in the first place (i.e., whether
shouldIntensify
is true or not).
The ranges of parameters look wide enough. My concern would have been the step size but if SMAC takes care of it, then all looks good. Fingers crossed!
@leonlan @rijalarpan what do you think of my suggested configuration space above? Each field reads as
<param type>(<param name>, (<low>, <high>), default=<default>)
, with optional step size/quantisationq
.I propose we do not test RELOCATE and SWAP separately: they're not that expensive now that we do intensification only when we find a new best solution. More depends on whether we do intensification in the first place (i.e., whether
shouldIntensify
is true or not).
Integer("nbElite", (0, 25), default=4, q=5),
Integer("nbClose", (1, 25), default=5, q=4),
I suggest changing these step sizes to q=1
. I don't have a good reason why, but I feel that (0, 4, 9, 14, ...) for nbElite
is a bit too coarse, idem for nbClose
.
Have I missed any [conditions]? There's one more for population:
nbElite <= minPopSize+generationSize
.
I'm running a hundred scenarios for penalty management right now. Should be done in, hopefully, a day or so.
For the next parameter group (after penalty management), I propose the following:
Param | Default value | Min | Max |
---|---|---|---|
minPopSize | 25 | 5 | 100 |
generationSize | 40 | 1 | 100 |
nbElite | 4 | 0 | 25 |
lbDiversity | 10% | 0% | 25% |
ubDiversity | 50% | 25% | 100% |
nbClose | 5 | 1 | 25 |
nbIter | 10K | 1K | 10K |
I'm grouping the nbIter
parameter with 'population management', and ignoring nbKeepOnRestart
(so, we keep that fixed at zero). That's a bit prescriptive, but the benefit is that this saves us one small parameter group to tune.
I'm running a hundred scenarios for penalty management right now. Should be done in, hopefully, a day or so.
They completed a few hours ago. I have finished a first analysis of them, and obtained a parameter configuration that's slightly better than what we had:
initialTimeWarpPenalty = 6
nbPenaltyManagement = 47
feasBooster = 2.5
penaltyIncrease = 1.3434343434343434
penaltyDecrease = 0.32196969696969696
targetFeasible = 0.4292929292929293
repairProbability = 79
repairBooster = 12
On the benchmark instances, this gets an average cost of 164276
(more than 25 better than current main).
With this new penalty configuration in place, I just started tuning the population parameters. Expect that again in a day or so.
For the next parameter group (after population management), I propose the following:
Param | Default value | Min | Max |
---|---|---|---|
nbGranular | 40 | 10 | 100 |
weightWaitTime | 2 | 1 | 25 |
weightTimeWarp | 10 | 1 | 25 |
circleSectorOverlapTolerance | 0 | 0 | 359 |
minCircleSectorSize | 15 | 0 | 359 |
postProcessPathLength | 6 | 1 | 8 |
Where we keep shouldIntensify
fixed at 1 (it's not costly anyway, and the last three parameters already toggle a lot of the cost/intensity).
I think we can skip the node and crossover operators for now: there we already know quite well what works.
It seems that tuning the population parameters hasn't resulted in a significant improvement. There's one set of values that seems slightly better, so I'm testing that now. But it might not pan out - the defaults here were already quite good.
None of the population parameter sets appear to be better than our current default: the one promising set I found gets an average objective +15 above our current settings. So we should probably keep the defaults for population management.
I'll set up the LS parameters now, and run that. That's the last set of static parameters to tune!
The LS stuff is in now too, and there's potential here for another thirty-ish points improvement. There are a few candidate parameter settings that I'll benchmark later today, and then I'll pick the best settings based on what comes out of that.
@leonlan @jaspervd96 before tuning the dynamic part: is there anything in the works there still that I have to wait for?
Still trying all kind of variations I can think of, but (from my side at least) still nothing that clearly beats the original. So I would at least not wait for that and start tuning soon to have enough time for this.
In the "worst case", we might last minute find something that's better, with the same tuning as the original instead of something specifically tuned on that.
Small addition: We should decide though if we can/want to tune certain parameters separately per epoch.
E.g. not one fixed threshold, but a separate threshold for each epoch.
Small addition: We should decide though if we can/want to tune certain parameters separately per epoch.
I talked to @leonlan about this, and we figured we might try a different threshold for epoch 0, 1, and >1. Or just 0 and >0, because the first epoch is a bit special.
Raw configurations + results of the static tuning runs are available here.
The best configuration is this one:
nbGranular = 34
weightWaitTime = 18
weightTimeWarp = 20
circleSectorOverlapToleranceDegrees = 216
minCircleSectorSizeDegrees = 332
postProcessPathLength = 7
with an average cost of 164228, and 48234 iterations.
minCircleSectorSizeDegrees = 332
This means that we're more or less trying all routes in the intensification phase. Based on that, I'm also trying something where we get rid of the whole circle sector overlap thing and just try all routes.
It seems that tuning the population parameters hasn't resulted in a significant improvement. There's one set of values that seems slightly better, so I'm testing that now. But it might not pan out - the defaults here were already quite good.
I think it may be worthwhile to tune nbIter
independently. Correct me if I'm wrong, but the current parameter tuning tests an entire configuration and, consequently, any positive effect of changing nbIter
is probably negated by negative effects of changes in the other population parameters.
@leonlan @jaspervd96 before tuning the dynamic part: is there anything in the works there still that I have to wait for?
I don't have any major changes, but I'm undecided whether we should use n_lookahead
equal to 1, 2 or 3. For each value, it seems that we can get similar performance when setting the corresponding threshold to 35, 25 and 15%. We could try to use parameter tuning to figure out which number of lookahead periods to use, but the threshold ranges and values should be different in each case. Is this doable? On the other hand, it shouldn't matter if we just fix 1, 2 or 3 because the performance of each is very similar and then we can reduce the number of parameters to tune.
minCircleSectorSizeDegrees = 332
This means that we're more or less trying all routes in the intensification phase. Based on that, I'm also trying something where we get rid of the whole circle sector overlap thing and just try all routes.
This gets 164224 average cost with 48905 iterations, so more or less the same outcome. It's a lot simpler, however, so I'll open a PR removing the circle sector stuff.
For tuning dynamic: a full dynamic run on eight cores takes about 3h20min, so 3h40min should be plenty on the cluster.
A dynamic instance is derived from a static instance. This process uses some randomness, so there's a seed controlling this (--instance_seed
). The solver also uses randomness, and there's a different seed controlling that (--solver_seed
). This results in different variabilities over ten runs with different seeds:
On the qualification and finals, we will be evaluated with different solver seeds (the instances - and their seeds - are fixed for us). So it's good that the variability in that is quite a bit smaller.
For rollout, we have the following parameters
Parameter | Default | Meaning |
---|---|---|
rollout_tlim_factor | 0.7 | Fraction of epoch time to spend on simulations; the rest goes to solving the dynamic instance. |
n_cycles | 1 | Number of simulation cycles to perform |
n_simulations | 50 | Number of simulations in each cycle |
n_lookahead | 1 | Number of epochs to sample ahead in each simulation |
n_requests | 100 | Number of requests per sampled epoch |
dispatch_threshold | 0.35 | We dispatch if the average number of times a request is dispatched (across the simulations) is above this threshold. |
I propose the following ranges:
Parameter | Default | Min | Max |
---|---|---|---|
rollout_tlim_factor | 0.5 | 0.6 | 0.9 |
n_cycles | 1 | 1 | 3 |
n_simulations | 50 | 25 | 100 |
n_lookahead | 1 | 1 | 5 |
n_requests | 100 | 50 | 100 |
dispatch_threshold | 0.35 | 0 | 1.0 |
But I do not have a lot of intuition for this. @leonlan @jaspervd96 what do you think?
Most parameter ranges look fine. The only change I suggest is to make dispatch_threshold depend on n_lookahead. All other parameters can be sampled independently.
n_lookahead=1 should take dispatch threshold (20, 35, 50) as (min, default, max)
For 2, (10, 25, 40)
For 3, (5, 15, 30)
For 4 and 5 I don’t know, but something like (5, 10, 20) should work. Maybe even skip 4, I don’t think it’s worth it to try both 4 and 5 because they don’t differ a lot from each other.
Verzonden vanuit Outlook voor iOShttps://aka.ms/o0ukef
Van: Niels Wouda @.> Verzonden: Friday, October 28, 2022 12:26:01 PM Aan: N-Wouda/Euro-NeurIPS-2022 @.> CC: Lan, L. (Leon) @.>; Mention @.> Onderwerp: Re: [N-Wouda/Euro-NeurIPS-2022] Algorithm tuning (Issue #33)
For rollout, we have the following parameters
Parameter Default Meaning rollout_tlim_factor 0.7 Fraction of epoch time to spend on simulations; the rest goes to solving the dynamic instance. n_cycles 1 Number of simulation cycles to perform n_simulations 50 Number of simulations in each cycle n_lookahead 1 Number of epochs to sample ahead in each simulation n_requests 100 Number of requests per sampled epoch dispatch_threshold 0.35 We dispatch if the average number of times a request is dispatched (across the simulations) is above this threshold.
I propose the following ranges:
Parameter Default Min Max rollout_tlim_factor 0.7 0.6 1.0 n_cycles 1 1 3 n_simulations 50 25 100 n_lookahead 1 1 5 n_requests 100 50 100 dispatch_threshold 0.35 0 1.0
But I do not have a lot of intuition for this. @leonlanhttps://github.com/leonlan @jaspervd96https://github.com/jaspervd96 what do you think?
— Reply to this email directly, view it on GitHubhttps://github.com/N-Wouda/Euro-NeurIPS-2022/issues/33#issuecomment-1294824728, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHMPWHVC2CIUJQ2AC7OGYO3WFOS3TANCNFSM55HPTNNQ. You are receiving this because you were mentioned.Message ID: @.***>
There are lots of parameters to the HGS algorithm, not many of which seem to be tuned particularly well. At some point, we should run a tuning tool (e.g., smac) to determine a good set of parameters.
Parameters, in this sense, also includes "which operators to use" (see also e.g. #32).