spatial clustering run time

jeffreyhanson commented 2 years ago

Joe had this comment about the app:

Spatial clustering much slower (not surprising)? It took about 15 min to run with 50% clustering, where other runs were very quick (> 1 min).

Yeah, it would be great to reduce run time for spatial clustering. The prioritizations are generated using a multi-objective approach that comprises two sequential optimization runs (one to find a solution based on goals + weights, and a second one that adds spatial clustering). One approach to reduce run time for spatial clustering would involve (1) using a starting solutions to speed up the second optimization run, and (2) using a time limit for the second optimization run. This is trivial for Gurobi - but I haven't worked out how to do this for CBC yet.

jeffreyhanson commented 2 years ago

This issue is also simialr to run time stuff talked about in https://github.com/NCC-CNC/wheretowork/issues/94

ricschuster commented 2 years ago

I think this is not critical for now, but we could try to figure this out in the next iteration.

We (NCC) could eventually also approach Gurobi and ask them if we could get a free license from them, given our line of work, but I'm not sure how far that is off and how realistic it actually is.

jeffreyhanson commented 2 years ago

I've been playing around with various methods to speed up the optimization process. It seems that these methods do have some success - but generating solutions for the pilot dataset with no selected Includes will still take ages. These methods involve modifying the secondary optimization run -- that is used to promote spatial clustering -- to:

Use the initial solution based on just cost + features + weights as a starting solution for the run. The idea here is to help speed up the solver by giving it a decent starting solution.
Calculate planning unit importance scores (based on range-weighted rarity) and lock in planning units selected in the initial solution that have high importance scores (i.e. above median). The idea here is to help speed up the solver by fixing variables that are likely to be selected in the second solution (because they are needed to meet the goals).
Instead of using boundary penalties, use connectivity penalties that parametrize connectivity between pairs of planning units based on range-weighted rarity scores (note that this is not just limited to planning units selected in initial run). The idea here is to help speed up the solver by making the objective coefficients more variable so that the solver has more information to go on.
Setting pair-wise connectivity values to zero if they are below the median. The idea here is to help speed up the solver by reducing the problem size -- and this is acheived by removing variables/constraints from the problem that are unlikely to influence the solution. Further analysis suggests that this might actually increase run time (at least on my computer anyway, perhaps with <30GB RAM this method might be useful).
Use a time limit of 5 minutes. Since we provide a starting solution in (1), at worst, we can expect that the solver will simply return a solution with no improvment in spatial clustering. The idea here is that asking people to wait longer than 5 minutes for purported "interactive" tool is a bit unreasonable. Note that the goals and weights will still always be met/handled correctly even if it times out after 5 minutes - this just affects the spatial clustering of the solution.
Setting a relaxed optimality gap of 50%.

In addition to thse six methods, I also played around with setting connectivity penalties based on the weights. This worked fairly well when using the "Protected Area distance" weight -- but this method won't work for problems where there are no weights selected. I think it will also be less likely to to work well for binary weights -- and this majority of weights in the pilot dataset. So, I haven't really looked into it further.

Although these methods are potenailly useful for this tool, it should be noted that they mean we lose optimality guruantees when using spatial clustering. I think this is ok because spatial clustering is often just a fudge used to make solutions look "better". In other words, assuming that users just want a spatially clumped solution that meets the goals/weights/include -- then optimality doesn't really matter too much anyway.

Also, for anyone random person reading this post, these methods are only being considered here because we (i) aim to provide an interactive tool and thus require solutions relatively quickly and (ii) cannot use commerical solvers (e.g. Gurobi or IBM CPLEX) for this particular application (even though they have free academic licenses). If your situation does not meet both of these requirements, I strongly suggest that you don't use any of the methods described above for reducing solver run times.

jeffreyhanson commented 2 years ago

@josephrbennett, I've just pushed a new version of the app to https://ncc.carleton.ca/ with the speed improvements mentioned above. When you get a chance, could you please take a look and see what you think?

ricschuster commented 2 years ago

Thanks Jeff! Which points did you incorporate? Looks like not 4., but how about the other ones?

jeffreyhanson commented 2 years ago

Yeah, that's right, I've incorperated all 6 except for 4 into 53f72dd.

NCC-CNC / wheretowork

spatial clustering run time #139