Calculating target value depending on group size in the context of full region coverage

remidelattre commented 1 year ago

1. Problem

Assuming we need every locality in groups of minimum N localities, how can we distribute « de » so that every group has – roughly – the same ratio between group sum and group size ?

2. Suggested solution

Generate random groups of a given number of localities. Force a coverage of 95 of the region, or 100% if we can handle components. Set a target value for "de" in every group. Compare random generations to the target value. Pick the best generation in a spreadsheets software.

3. How ?

The metric would follow a simple calculation : (Average locality deficit in a given region Desired group size) For REG93, that would be (-3500/943) = (- 3.711 32) = -118.750

The point is not to beat the metric. Adjacency constraints make it impossible. A good group is a group that gets close to the value. However, we should look at every group in a generation and measure the sum of deviations from ideal group value for every group. We could sum those. As a generation gets closer to the ideal value, the score will go down.

Read from adjacency list
Generate random groups

Calculate metric (it's obviously better to have a GUI and let the user determine ideal value, but modifying a python script is easy enough)

Save generations that get fairly close to the ideal value (or save every generation ?)

Ideally, we'd print a csv with the first column being group 1 localities, the second being group 2 localities, etc. Then, a score based on the provided metric :

Generation	Group 1	Group 2	Group 3	Group 4	Group 5	Group 6	Group 7	Group 8	""	total score
1	insee ids here	insee ids here	insee ids here	insee ids here	insee ids here	insee ids here	insee ids here	insee ids here
2	insee ids here	insee ids here	insee ids here	insee ids here	insee ids here	insee ids here	insee ids here	insee ids here

We could apply a tolerance threshold to prevent certain really bad generations from being saved, but its hard to define a tolerance threshold beforehand. I can't tell how bad or how good those groups will get in N iterations.

Force group size, a target coverage and determine the number of generations

4. Results

The results will read like this : Assuming we have to make groups of 32 localities, this is the best way to balance land take deficits between localities of a given region given adjacency constraints. However, making [smaller, bigger] groups would allow a better balancing. I'll run this for 32, then for 31, etc, saving the best generation everytime.

JamesBremner commented 1 year ago

The metric would follow a simple calculation : (Average locality deficit in a given region * Desired group size)
For REG93, that would be (-3500/943) = (- 3.711 * 32) = -118.750

I am guessing this is how you want to calculate the target group deficit sum from the input. Seems doable. I will open an enhancement issue.

This is not an enhancement issue - it is a collection of many features that you have piled on top of each other. Please try and keep features separate. Trying to develop multiple feature all at once is a certain way to get into a horrible mess

JamesBremner commented 1 year ago

Generate random groups of a given number of localities. Force a coverage of 95 of the region, or 100% if we can handle components. Set a target value for "de" in every group. Compare random generations to the target value. Pick the best generation in a spreadsheets software.

Do I understand this correctly? Are you are proposing a random search through an incredibly enormous solution space?

This will take forever! After a few hours you will mostly be re-examining assignments identical to previously generated ones. Essentially you are starting fresh at each generation, throwing away any knowledge you might have gained in the previous iterations.

You need to design some hill climbing heuristic. no matter how poor, that will guide the search and drastically reduce the examination of duplicates.

This seems to be a good moment to remind you of my two suggestions, which so far you have ignored.

Suggestion 1:

I expect that the ungrouped localities will be mostly disconnected from each other because their neighbors have been assigned to groups. So we could try forcing any that are connected to each other into groups, without regard to the acceptable de sums or group size. We would allow even single localities, that are isolated because ALL their neighbors have bee assigned, to be in a 'group' of one locality.

Suggestion 2:

- LOOP U over unassigned localities
    -   LOOP A over adjacent assigned localities to U
          - Add U to adjacent group that moves the de sum of the group the least distance away from zero.

Either one of these will give you a much better result, in terms of coverage, and do it in a few seconds.

remidelattre commented 1 year ago

Do I understand this correctly? Are you are proposing a random search through an incredibly enormous solution space?

This was the idea. Time is no issue. With simulated annealing for example, we're exporting a new .csv everytime a better solution is found. When we find that the program has been running for hours without a single improvement in coverage, we close it and take the best coverage available. I'm not sure I understand your point about identical assignements : either the solution space is huge which means the solutions have a very little chance of repeating themselves, either we get a lot of the same generations in which case the assignement is not random. It's impossible to know what the actual best groups are for a given group size, therefore it makes sense to settle for suboptimal values. An advantage of this is, also, the possibility of running it on multiple computers : whoever ends up with the best score wins. A flaw is that we'd always be tempted to wait a little more, just in case. However, if we know the number of localities, the required group size and the number of adjacencies, is there no way to calculate the number of possible solutions ?

Regarding suggestion 1, I would agree with this idea, but then again, looking at groups generated by Depaver, it is quite rare to see them touch each other. Most of the time, there is space inbetween groups to generate other groups that would fit the criterion. If you think this is worth pursuing to increase coverage, please do. The issue is that group size has to remain fairly identical, and this suggestion does not seem to adress that, as it generates "groups" of one locality.

Regarding suggestion 2, I think this is a good idea if we're able to implement it within the dynamic range we previously discussed. The best way to increase coverage, empirically, is to accept negative, and sometimes worse groups (as in worse by the metric). I'd be interested in seeing how suggestion 2 affects coverage. DFS can be used to remove localities from a group and assign them elsewhere because they provide more value. This is somewhat similar to what a human would do if tasked to generate groups based on size and sum constraints : iterate, remove localities, assign them elswhere, keep the big surplus localities for later on because some high deficits have yet to be assigned, etc.

JamesBremner commented 1 year ago

There are ~1000 localities in region 93. You mention very large group sizes ( despite knowing that this reduces your coverage and is practically impossible ) so lets go with groups of 30. The number of different ways to select 32 from 1000 is

For comparison, there are only 10 to the24 starts in the entire universe.

And remember, you have no way to prevent re-examining the same assignment ,multiple times.

remidelattre commented 1 year ago

Well, that answers my question. Though I stand by the idea group size itself does not reduce coverage. Group size with hard "de" constraints will reduce coverage. If you can tolerate a delta of 200 between your best and your worst 32 group, you'll be able to cover 100% just fine (see my handmade groups). For now, let's try maintaining some heuristics, and if all else fails, I'll resort to probabilities. I'll open an issue dedicated to suggestion 2.

JamesBremner commented 1 year ago

With simulated annealing for example,

As I understand you, you are NOT proposing simulated annealing. That is not a random search. It takes a random selection from the solution space and works forward from there, by shaking things up and then doing lots of local optimizations to, hopefully, improve the total solution. Previous work is not thrown away.

Please, review whatever tutorial you have on how simulated annealing works, and then we could talk about possibly applying the method to this problem. ( I have a done a few simulated annealing applications in the past, but I must warn you that it tends to be extremely slow and can usually be greatly improved by using something else. )

remidelattre commented 1 year ago

Exactly ! We actually pair simulated annealing to "shake things up" and DFS to check adjacency when moving localities from a group to another. Here, I was not comparing random generations to the way simulated annealing works. I was saying we're just used to slow result generation. It is very slow, as I've mentioned. But the coverage is good. How good ? We'll know in a few days.

I am not proposing simulated annealing because we are already doing it with a python script. If you think there's a better way to solve this question : Assuming we need every locality in groups of minimum N localities, how can we distribute « de » so that every group has – roughly – the same ratio between group sum and group size ? or at least a way that could compete with it, it'd be great, because it would allow for quality control of our own methods.

Our point is not to use every method. We'll eventually settle for the one which yields the best result, as per coverage and metric. But we need to discuss why we choose a method over another, hence why trying several ways is interesting.

JamesBremner commented 1 year ago

Time is no issue.

Are you sure? In five billion years or so the sun will expand into a red giant and consume the Earth. I think that your python script might well glitch when that happens.

remidelattre commented 1 year ago

Hahaha yeah. The probabily of it glitching, or windows randomly updating my system was an issue before we figured we could just save our latest and best result periodically. Regardless, using random generation when heuristics is potentially available is probably not the best idea, although it can serve to illustrate how well algorithms are doing when confronted with the dire obligation of 100% coverage. Keeping this open if you'd like to discuss algorithms that could potentially compete with simulated annealing. Was not familiar with algorithms a week ago, and obviously perhaps missed some that would be able to compete with simulated annealing.

JamesBremner / so77307162

Calculating target value depending on group size in the context of full region coverage #18