Understanding spatial clustering

edwardsmarc commented 8 months ago

@jeffreyhanson - hoping you can confirm my understanding of the spatial clustering code and answer my questions below.

If spatial clustering is turned on, the following happens when running fct_min_set_result():

An initial solution is run using no clustering. The most irreplaceable PUs are identified and locked in to the second prioritizr problem that generates the clustered run.
For the clustering problem, a PU cost of zero is used, and a new feature is added where all PU's have a value of one, and an associated max_cost target of initial solution cost * boundary_gap + 1. The number of PUs selected cannot exceed the max_cost. So a boundary gap of 0.5 adds an additional 50% of budget to play with when trying to cluster PUs.
- Question 1 -is there any difference between adding max_cost as an additional target, and adding it as a linear constraint?
- Question 2 -What's the advantage of setting cost to zero? Faster?
Clustering is achieved in the second prioritizr problem using add_connectivity_penalty(). A penalty value of 1 is used to balance the connectivity objective with the min_set objective of meeting targets.
- Question 3 - how is the connectivity objective weighted compared to the min_set objective when a penalty of 1 is used?
  - The connectivity matrix defining connectivity between adjacent cells is asymmetrical. Note that since we're still on prioritizr 7.1.1 we are using the old add_connectivity_penalty() and not add_asym_connectivity_penalties().
  - Values in the matrix are calculated as the irreplaceability value of the current cell, plus the cost of the adjacent cell. This is done by the line con_data@x <- rwr_raw[con_data@i + 1] + cost[con_data@j + 1].
  - Question 4 - why do we add cost to the connectivity values? Is it to encourage the use of PUs that have high weights?
  - Question 5 - can you confirm that when using an asymmetrical connectivity matrix, the direction of connectivity is from the row to the col? e.g. the matrix value in row 1 column 2 is describing connectivity from PU1 to PU2, not PU2 to PU1 (I realise the code snippet above answers this question for me but just want to confirm. It might be useful to add this to the prioritizr documentation for add_asym_connectivity_penalties()).

jeffreyhanson commented 8 months ago

Hi @edwardsmarc,

If spatial clustering is turned on, the following happens when running fct_min_set_result():

An initial solution is run using no clustering. The most irreplaceable PUs are identified and locked in to the second prioritizr problem that generates the clustered run.

For the clustering problem, a PU cost of zero is used, and a new feature is added where all PU's have a value of one, and an associated max_cost target of initial solution cost * boundary_gap + 1. The number of PUs selected cannot exceed the max_cost. So a boundary gap of 0.5 adds an additional 50% of budget to play with when trying to cluster PUs.

Yeah that's all correct!

Question 1 -is there any difference between adding max_cost as an additional target, and adding it as a linear constraint?

Nah, when using a min set objective, the targets and linear constraints are effectively doing the same thing. When using a budget limited objective though (e.,g. min shortfall objective), then this would have a difference because the targets are treated as "soft constraints" (this means you're willing to accept a solution that doesn't meet the constraint, but gets as close as possible) instead of "hard constraints" (this means you won't ever accept a solution that doesn't meet the constraints).

Question 2 -What's the advantage of setting cost to zero? Faster?

Nah, the overall objective function (i.e. the metric being minimized or maximized during optimization) for the optimization process is a combination of the primary objective (e.g. add_min_set_objective() or add_min_shortfall_objective()) and any penalties added to the problem (e.g. add_boundary_penalties()). By setting the costs to zero for a problem with a min set objective, this means that the overall objective function for the optimization problem focusses entirely on the penalties. For the WTW tool, this means that the objective function focuses on the spatial clustering. Note that the optimization problem still cares about the feature representation, because in the second optimization run this is handled using constraints (not the objective function).

Clustering is achieved in the second prioritizr problem using add_connectivity_penalty(). A penalty value of 1 is used to balance the connectivity objective with the min_set objective of meeting targets.

Question 3 - how is the connectivity objective weighted compared to the min_set objective when a penalty of 1 is used?

Yeah, since the costs are zero, the overall objective function focusses entirely on the the spatial fragmentation (via add_connectivity_penalty() because it has a non-zero penalty. Since the costs are zero, you could have any non-zero penalty and have a similar result (in practice, you don't want too high a penalty value because really high values slow down the optimization process due to scaling issues).

The connectivity matrix defining connectivity between adjacent cells is asymmetrical. Note that since we're still on prioritizr 7.1.1 we are using the old add_connectivity_penalty() and not add_asym_connectivity_penalties().

IIRC, in older versions of prioritizr, the add_connectivity_penalty() handled both symmetric and asymmetric connectivity. To avoid confusion -- because someone might want symmetric connectivity but accidently supply asymmetric data because they made a mistake in preparing their data -- this functionality was split across two functions.

Values in the matrix are calculated as the irreplaceability value of the current cell, plus the cost of the adjacent cell. This is done by the line con_data@x <- rwr_raw[con_data@i + 1] + cost[con_data@j + 1].

Question 4 - why do we add cost to the connectivity values? Is it to encourage the use of PUs that have high weights?

Yeah, that's right. I think the idea here was to (1) try and speed up the optimization process and (2) promote spatial clustering around the really important planning units selected in the solution.

Question 5 - can you confirm that when using an asymmetrical connectivity matrix, the direction of connectivity is from the row to the col? e.g. the matrix value in row 1 column 2 is describing connectivity from PU1 to PU2, not PU2 to PU1 (I realise the code snippet above answers this question for me but just want to confirm. It might be useful to add this to the prioritizr documentation for add_asym_connectivity_penalties()).

Although I think the WTW tool is using symmetric connectivity, when using asymmetric connectivity the data is specified like this. Imagine we have 100 planning units, so we have 100 rows and 100 columns. The cell [1, 10] describes the connectivity from planning unit 1 to planning unit 10, and the cell [10, 1] describes the connectivity from planning unit 10 to planning unit 1.

How does that sound? Let me know if you have any further questions?

edwardsmarc commented 8 months ago

For the WTW tool, this means that the objective function focuses on the spatial clustering. Note that the optimization problem still cares about the feature representation, because in the second optimization run this is handled using constraints (not the objective function).

OK, so because the cost is zero there is no way to 'minimize' cost because the sum will always be zero. That makes sense. So the problem is being driven by the connectivity objective which is to maximise the sum of the connectivity values. And the targets will still all be met because we're using add_min_set_objective(). And we ensure a cap on the number of planning units using the max_cost target.

jeffreyhanson commented 8 months ago

Yeah - that's correct!

edwardsmarc commented 8 months ago

Thanks for bearing with me while I got up to speed with all that!

Now I understand the code I think I can answer my own question which I'll record here for others to discuss:

A user reported some strange results when using increasing spatial clustering %'s in WTW. Clustering worked fine but then above a certain value the solution started filling in from the north west corner:

This example shows the min_set solution, and solutions with 30% (s30) and 50% (s50) spatial clustering.

My conclusion is that all the targets are met but there is still unused budget in the max_cost target. Since the min_set objective is essentially deactivated due to the zero cost values, it just uses up the max_cost budget by selecting PUs starting from the top of the matrix. I can confirm that the solution with spatial clustering set to 50% does actually include 50% more PUs in the solution.

edwardsmarc commented 8 months ago

@jeffreyhanson - One thing that is not clear to me is why prioritizr uses up all the max_cost budget. The max_cost target has a sense of <= in add_manual_targets() so I'd expect the solution to stop adding PUs once all the species targets are met. It seems like the max_cost target is acting more like = than <=. Any thoughts on this?

jeffreyhanson commented 8 months ago

Yeah, when using the add_min_shortfall_objective(), if all the targets can be met within the specified budget, then optimization algorithm is happy to return any solution that meets all the targets. For example, if you had a budget of 70 cost units and all the targets could be met with a total of 40 cost units, then any solution that meets the targets and costs anywhere between 40 cost units and 70 cost units would be considered optimal. This is because it doesn't have any other critiera/objectives/constraints to indicate that a lower cost would be considered "better". So, generally what ends up hapenning is that solution will randomly contain extra planning units that aren't "needed" to meet the targets.

To avoid this behaivor, potential options include (1) checking if all the targets are met in the min shortfall prioritization and, if so, then running a min set prioritization to generate a solution that meets the targets for minimum cost or (2) directly interfacing with the Gurobi software to run multi-objective optimization (note that this approach wouldn't work for the CBC solver, since it doesn't provide an equivalent API for multi-objective optimization). Although you could try playing with the add_linear_penalties() function to indicate that you want to minimize both the target shortfalls and the planning unit costs -- I wouldn't advise this because finding a good penalty value that would only affect the solution when all the targets can be met is not trivial (especially if you want to avoid numerical issues that slow down the solver). How does that sound?

edwardsmarc commented 8 months ago

Thanks @jeffreyhanson, that behavior with add_min_shortfall_objective() makes sense to me. The pattern I'm seeing though is using add_min_set_objective() (i.e. no area budget is set in WTW).

I just ran some tests where I run the min_set problem (i.e. no area budget and no clustering in WTW). Then I run the same problem with 30% and 50% clustering. In all cases the solution uses the max_cost budget exactly.

For example, one of my min_set runs has a solution with 45,120 planning units. The 30% spatial clustering solution uses 58,656 (i.e. 45,120 1.3) planning units and the 50% spatial clustering solution uses 67,680 (i.e. 45,120 1.5) planning units. In both cases the species goals are easily met, but it always uses up the full budget of planning units by adding pus from the top of the map.

jeffreyhanson commented 8 months ago

Hmm, it's not surprising that it always uses up the max_cost constraint with spatial clustering, because the objective function for the second optimization run (when doing the spatial clustering) is to promote spatial clustering and this generally requires selecting more planning units. If the second run didn't have a max_cost constraint, then the prioritization would probably select the vast majority of planning units -- because one gigantic reserve is more spatially clustered than lots of small reserves. It is surprising that it's just selecting lots of planning units in the top corner though. Are you using weights, if so I wonder if these planning units have smaller weight values values?

edwardsmarc commented 8 months ago

Hmm, it's not surprising that it always uses up the max_cost constraint with spatial clustering, because the objective function for the second optimization run (when doing the spatial clustering) is to promote spatial clustering and this generally requires selecting more planning units. If the second run didn't have a max_cost constraint, then the prioritization would probably select the vast majority of planning units -- because one gigantic reserve is more spatially clustered than lots of small reserves. It is surprising that it's just selecting lots of planning units in the top corner though. Are you using weights, if so I wonder if these planning units have smaller weight values values?

Not using any weights so the costs should just be the connectivity values. I'll do some more digging. I'll extract the connectivity values as a raster to see exactly what it's operating on.

NCC-CNC / wheretowork

Understanding spatial clustering #354