[Backsearch p1] [Spike] Investigate the INSPIRE updates

lin-d-hop commented 10 months ago

Description

Continuing on from DigitalCommons/land-explorer-front-end#47... the story so far is that the Property Boundaries have been moved to an independent service with a few extra features for backsearching. Now we want to update the data more consistently.

Conversations on this are taking place in this notes doc.

This issue is about researching how accurate the matching of polygons is when we update the INSPIRE dataset so that we can write a script that pulls in these updates with confidence.

Question 1: How many INSPIRE polygons are a perfect match for polygons in our database?

Match on INSPIRE ID
Check the coordinates are a perfect match (regardless of presentation order)

Question 2: For any that don't match can we get a confidence interval based on the level of overlap with polygons in our database? eg 90% polys match with >85% overlap, 87% polys match with >90% overlap etc

Question 3: For any that don't match, for which we have a title ID, can we geocode the title's address? Does the geocoded address fall within the new INSPIRE polygon? How many of the non-matching polygons can we match in this way?

As the data sets are big you might like to experiment with limited dataset. The goal is to be confident with a strong set of matches in a formulaic way so that we can script this.

Acceptance Criteria

a. A percentage of perfect matches (q1) b. A set of matches aimed at getting us to over 99% of polygons matching if possible. Or exploring the general conditions in which we do find matches. c. For any non-perfect matches, how many additional matches can be found using the title ID? d. A sample of properties we failed to match such that we can do some manual investigations

rogup commented 10 months ago

Things I've done so far:

Fixed and improved download script for INSIRE datasets
Researched projections and changed ogr2ogr to use the right one (EPSG:4326) to match our current data and GeoJSON format required by frontend
Added index and changed data type in our existing DB so that lookups are faster
Created analysis script to try to match polygons against our existing DB by INSPIRE ID. It first tries to find an exact match (with any presentation order of vertices), then finds the percentage intersect of the polygons. I noticed that almost all the polygons seem to be offset (with the same vertices but translated some fixed distance), so I've also added analysis about these offsets.

Here are some graphs of the output of the analysis script, which was run on the INSPIRE data for 15 councils (just the first 15 alphabetically), on 5000 polygons from each council. There were no exact matches, which is interesting (and a bit annoying!). About 3% of the 75000 tested INSPIRE IDs where new, which is good and indicates that the govt isn't changing lots of IDs for no reason.

For each polygon old vs new, I calculated the difference between lat and long between each old vertex and corresponding new vertex. I then calculated the mean of these offsets for each vertex (lat and long), and then took the greater of these means for simplicity (since lats and longs were offset by similar order of magnitude) to get a single offset mean for each polygon. I did the same thing for each polygon's offset standard deviation.

The offset mean graph shows that the polygons in each council all seem to be offset by the same small distance, but slightly different for each council. The standard deviations are all tiny (in the order of a billionth of a degree, 10,000 times smaller than the mean offset). This indicates that these are pretty much exact offsets, with each vertex of a polygon being translated the same way.

Next steps:

Have a closer look at some offset polygons manually, maybe linking it up to mapbox to see where the new bounaries lie with the new offsets and how this differs from the existing data.
- The offsets are possibly corrections for each council, which might be the case since the existing polygons shown in the live LX app do seem to be slightly off.
- Or something may have gone wrong with the map projection using the ogr2ogr tool.
If the offsets seem legit, alter the script for percentage intersects to take this into account. Based on our current analysis, this seems like it will give us a strict confidence interval, which is ideal.
Have a look at some of the non-matching INSIRE IDs manually, and see if they match up to any existing polygons with different IDs.
- This should give us more info to decide whether to write the 'geocoding titles' part of the script for non-matching INSPIRE IDs

intersects offset_mean offset_std

rogup commented 10 months ago

@lin-d-hop Here's an update ^ Please share any thoughts on the analysis so far and any suggestions you have.

@wu-lee I'm a bit concerned about there possibly being no backup of the polygons database, since we no longer have access to the data. It's a bit risky to be doing this analysis with lots of DB operations (albeit all read operations). I think we should make regular backups and maybe also another copy so that we can have a staging and production instance of the property-boundaries-service. Especially when we start testing writing updates in the DB. Do you know how easy it would be to make a one-off backup of the MySql DB for now, maybe even just on another user on dev-2? I'm hoping not too difficult since most of the deployment is defined by ansible?

rogup commented 10 months ago

I've had a manual look at some of the polygons that matched with an offset.

These old polygons in Adur district council seem to be slightly off, compared with the property outlines on Mapbox. But now in the new INSPIRE data, the polygons have been shifted to fit exactly in their correct position Screenshot 2023-12-30 at 01 34 38 Screenshot 2023-12-30 at 01 35 25

Similarly, in Babergh district council (the council with the biggest offset out of the ones I analysed), old polygons have been shifted to be more aligned with their expected positions (in a different direction to those in Adur). Screenshot 2023-12-30 at 02 03 45 Screenshot 2023-12-30 at 02 04 39

I also had a look at a polygon which we couldn't match with an offset, since there was a different number of vertices (367 in the old data vs 378 vertices in the new data) so we couldn't match vertex-to-vertex. But looking at the data on the map, the polygons look identical. The new vertices have been added to improve the smoothness of some of the more rounded edges. I tried offsetting by the same distance as a nearby polygon, then calculated the % intersect, and it was a >99.5% match. It looks like there many other cases like this. Screenshot 2023-12-30 at 17 02 32

rogup commented 10 months ago

After looking at these manual cases, I refined my analysis script to:

Look for a vertex-to-vertex match with a near-constant offset (i.e. low variation across the vertex offsets, equivalent to a few centimeters).
If this isn't possible (due to a different number of vertices), try offsetting by the same as the previous polygon in the dataset, which seems to always be adjacent or very nearby. Then calculate the percentage intersect.

Running this on 10,000 properties in Birmingham, there were only a handful (<1%) that had an intersect of less than 99.5%. Here's a histogram of their percentage intersects:

I then had another manual look at some of these cases, to understand what is happening, and to indicate what % intersect threshold we should use in the final script.

These had a 98.1% match. Boundaries look very similar, ignoring the offset, just slightly scewed on the top edge
91.4% match. One small part of the property has been removed, and longer seems to be part of the INSPIRE dataset. Which is strange, and we should double check this, since I thought INSPIRE should containe all freehold boundaries
86.2% match. A larger chunk of this property has been removed. I need to check what happened to the segmented part with this one too, I forgot to do this
2.1% match. The polygon with the original ID has become just a tiny part of the original polygon, and the larger segmented region has a new INSPIRE ID.

Next steps:

Decide on an appropriate % intersect threshold for when we think polygons fully match, and test this against more data from other councils.
Improve the script to recognise and classify cases of polygons splitting, and try to find the IDs of the new segmented polygons if they exist
Re-run the script to see if there any other scenarios, e.g. what happens when boundaries merge?
@lin-d-hop Decide what we actually want to do in these segmentation/merge cases in the final script e.g. do we want to store this history somehow, and eventually display this info in the UI? Is there a way to work out if there are new owners and, if so, who they are from other govt datasets? (e.g. this one https://www.gov.uk/guidance/about-the-price-paid-data , matching the address with the linked title)

lin-d-hop commented 10 months ago

@rogup Thanks for all of this analysis. It is really really awesome!

Totally agree regarding backups. Let's action this? A monthly dump to another server would work. @rogup @wu-lee thoughts about what this backup should look like?
Great to know properties mostly match and that the corrections appear to be valid improvements. This is great! Based on your examples above, I would say the 98.1% match is a full match. Fro everything less than that it would be great to understand is there are new properties for the subdivide (as you suggest). You next steps look good.

In the case of segment/merge it would be interesting to store this data, though I would say we do so in the simplest way possible, maybe even backups. Working with this historical data isn't the priority right now but at the same time let's leave the door open.

The main thing we need to try and figure out is the link between the new segment INSPIRE ID (if it now has one) and the title ID. The 'price paid' data set doesn't seem to have title ids or inspireids in it (correct me if I'm wrong). Perhaps one approach is that when there is a new title ID in the 'companies that own land' data we geocode that address and aim to match to an unmatched inspire id? For new segments that are privatelly owned I'm not sure there is any route to finding the title id but it is a good idea to try and find open data sets with title IDs.

rogup commented 10 months ago

@lin-d-hop I've made a ticket for a backup service: https://github.com/DigitalCommons/technology-and-infrastructure/issues/116

Cool, I'll have a think about a simple way to keep segment/merge data.

I think your suggested approach of geocoding new title IDs and trying to match with unmatched inspire IDs sounds like a good and simple first step. You're right that the 'price paid' data doesn't include any IDs that we can link. It provides a post code, so maybe something for the future, we could improve our script using this data + our own script's analysis to increase our confidence when piecing together information. I also think that there's the potential to eventually provide some REALLY juicy info about sales e.g. a user could see how much property a council has been selling to private companies/people. That could have some exciting use cases!

rogup commented 10 months ago

I've made some progress today on an algorithm to detect segmentation (which can sometimes be incomplete if part of the old boundary is no longer registered as an INSPIRE polygon). It has been very tricky and fiddly, but I think necessary, since there are lots of cases like this. It should be easy to extend this to merge cases.

I'm also seeing quite a few cases of boundaries slightly shifting between adjacent properties, so I want the algorithm to be able to process this too.

I just saw a weird case in Adur District council in a residential block of flats (6, 7, 9, 10, 12, 18 and 21 Woodview, Shoreham by Sea and Garages), where lots of INSPIRE polygons were overlapping/duplicates. Some polygons with the same ID have now moved around a bit to other boundaries within the block of flats but there are still duplicate polygons.

I think cases like this are ones that we will need to Geocode and just treat it as if the old boundary was wrong and has now been moved to the correct position.

rogup commented 10 months ago

@lin-d-hop Testing my current script on a sample of 15000 polygons from one council, it's managing to confidently match ~99.9% of them. This is more than the 99% in the acceptance criteria for this ticket.

But doing a very rough crunch of the numbers, this is still likely to give ~20,000 failed matches if we run the script on all of the millions of polygons in the INSPIRE dataset. I think it's worth me spending a few more days to finish the segmentation algorithm, merging algorithm, and detecting when boundaries between adjacent properties shift. This will greatly reduce the number of failed matches, so that it's easier to look through them manually.

Note that there are lots more changes now than there would be month-to-month once the pipeline is functioning normally, since we're comparing against data that is several years old. I think in a few days, we'll be at a point where there are maybe only a handful of edge cases each month that the script can't classify. We'll then be able to look at them manually on a case-by-case basis and decide whether to just accept/reject the change or make improvements to the script to handle them automatically in the future.

lin-d-hop commented 6 months ago

Moving forward from this R&D phase. Closing this issue and replacing with DigitalCommons/property-boundaries-service#12

DigitalCommons / property-boundaries-service

[Backsearch p1] [Spike] Investigate the INSPIRE updates #10

Description

Acceptance Criteria