Make `Election.get_example_postcode` less slow

GeoWill commented 3 years ago

https://elections.democracyclub.org.uk/elections/gla.a.2021-05-06/ Gives a 502.

chris48s commented 3 years ago

Not actually tested this, but my money is on this query: https://github.com/DemocracyClub/EveryElection/blob/33eff41ee51d6bdc2d8d2ba57a0afbed48ca5ef4/every_election/apps/elections/models.py#L269-L279

My guess would be it is fast if the polygon is small/slow if the polygon is large, or maybe it depends on how many postcode points are contained by the polygon.. I suspect that's why https://elections.democracyclub.org.uk/elections/local.adur.buckingham.2021-05-06/ just loads instantly but https://elections.democracyclub.org.uk/elections/gla.c.barnet-and-camden.2021-05-06/ stalls for ~10 seconds https://elections.democracyclub.org.uk/elections/pcc.west-midlands.2021-05-06/ stalls for ~15 seconds etc

Suspect you've got your own ideas at this point, but some ideas on making it more efficient:

Drop the constraint that you're trying to get one near the centre point of the polygon and just take any old postcode with a centroid inside the polygon (think that's what mapit does).
If you want to keep that constraint would be to work out an example for every polygon at import time (either polygon or ONSPD) and store it in the DB instead of trying to calculate it on-the-fly
Keep the constraint and work it out on the fly but only use a random sample of up to N ONSPD records to put a hard limit the number of distances you need to calculate

GeoWill commented 3 years ago

I think you're spot on about the culprit. This is a common enough thing that PostGIS has the capability built in to do a nearest neighbour search based on the bounding boxes. In our case these are the same as the points, so it comes to the same thing - this is basically your option 3.

This should look something like:

Onspd.objects.raw(
    """
    SELECT * FROM uk_geo_utils_onspd onspd
    WHERE ST_WITHIN(onspd.location, %s::geometry) 
    ORDER BY onspd.location <-> %s::geometry
    LIMIT 1
    """,
    [self.geography.geography.ewkt, self.geography.geography.centroid.ewkt]
)[0]

But I get ctypes.ArgumentError: argument 3: <class 'TypeError'>: wrong type. I think this is a separate issue, as it throws the same error if I try something like Onspd.objects.raw("SELECT * FROM uk_geo_utils_onspd order by pcds limit 1")[0].

Just recording this here as it doesn't feel like a huge priority if #947 is good enough for now.

GeoWill commented 3 years ago

This doesn't 502 any more after #947, but it's still slow. Pulling together options from up thread and out of channel chats:

Calculate ahead of time and store in the DB.
Put an upper limit on number of postcodes considered for any polygon.
Make that raw query work.
Some cunning custom function to let us use <-> in the orm

I've ignored dropped the option of dropping the constraint because then we can get split postcodes. Unlikely but we have enough other options.

chris48s commented 3 years ago

Just thinking about this a bit more.. I think the reason why Sym implemented this as "pick a postcode that is close to the (multi-)polygon centroid" was to try and avoid picking one that is split across multiple divisions. While this does achieve that in many cases, the centroid isn't necessarily actually inside the polygon. For example, if a polygon is shaped like a C or an O the centroid is outside the polygon and picking a postcode with a centroid close to it actually makes it much more likely you're going to pick a postcode that is split across multiple divisions because you're going to pick the closest postcode to some edge. So.. two thoughts:

The thing we really want to achieve is not so much "pick a postcode that is close to the (multi-)polygon centroid" but "pick a postcode that is a) inside the (multi-)polygon and b) far away from edges". If we think of that as the end goal, does that lend itself to any more efficient implementations (and also reduce cases where the example is split across multiple divisions)?
This won't do anything to improve performance, but if we stick with that rough method, as well as the centroid there is also ST_PointOnSurface() (see https://gis.stackexchange.com/a/76563 for explanation). I think using that as the point you want the example postcode to be close to would have an advantage over ST_Centroid() because unlike centroid it is guaranteed to actually be inside the (multi-)polygon. I think using that as the target would probably reduce the number of cases where this method chooses an unhelpful example, but it wouldn't completely "fix" it. For example, if you imagine a polygon that is shaped like an egg-timer, this method will still going put your target point right at the narrowest part of the polgygon (where there is maximum change of being near an edge).

GeoWill commented 3 years ago

"pick a postcode that is a) inside the (multi-)polygon and b) far away from edges" is definitely what we're trying to do. I think the best way to do this is just to pick a point that falls in a negative buffer of the polygon. I had discounted this because a. performance and b. fiddly-ness of setting the buffer (too big and you collapse small areas, too small and you don't get far enough away from the edges.) However what else are you meant to do on Friday nights these days... fiddly sql be here. I haven't refactored it to run through every org, but I've tried to check it on some extreme examples and it runs in under a second even on places like Highlands. I guess I'm not sure that's a query I'd want to leave lying around the codebase though. It also isn't a huge buffer, to avoid collapsing little polygons.
Nice link (I've upvoted that answer but had completely forgotten about it). I think this is a better approach than the centroid. However as you point out none of these eliminate the chance of showing a split postcode, so maybe we just shouldn't worry about it :roll_eyes:

Related Trivia: the organisation geom with its centroid furthest from the nearest onspd postcode point is North Ayrshire (presumably due to sparse population, and followed by Bristol (!) presumably due to Steep Holm induced boundary weirdness.

chris48s commented 3 years ago

lol yeah the centroid point for Bristol is probably somewhere in the middle of the channel :laughing:

However as you point out none of these eliminate the chance of showing a split postcode, so maybe we just shouldn't worry about it

It would be possible to pick a postcode that is guaranteed not to overlap with addressbase (maybe another reason to calculate it once and then explicitly store it).. unless of course there is a division somewhere where every postcode with one or more properties inside it also has one or more properties in another division (i.e: every postcode is split). That would be an amazing edge case..Blatant nerd-snipe attempt: I wonder if it exists? :thinking:

GeoWill commented 3 years ago

Django 3.0 has support for <->. It's also available in the DRF framework. This doesn't eliminate the possibility of getting a split postcode, but will significantly boost performance.

DemocracyClub / EveryElection

Make `Election.get_example_postcode` less slow #933