DigitalCommons / property-boundaries-service

A service to download updated versions of the land registry property boundaries and serve with company information.
0 stars 0 forks source link

[Backsearch p1] [Spike] Investigate the INSPIRE updates #10

Closed lin-d-hop closed 6 months ago

lin-d-hop commented 10 months ago

Description

Continuing on from DigitalCommons/land-explorer-front-end#47... the story so far is that the Property Boundaries have been moved to an independent service with a few extra features for backsearching. Now we want to update the data more consistently.

Conversations on this are taking place in this notes doc.

This issue is about researching how accurate the matching of polygons is when we update the INSPIRE dataset so that we can write a script that pulls in these updates with confidence.

Question 1: How many INSPIRE polygons are a perfect match for polygons in our database?

Question 2: For any that don't match can we get a confidence interval based on the level of overlap with polygons in our database? eg 90% polys match with >85% overlap, 87% polys match with >90% overlap etc

Question 3: For any that don't match, for which we have a title ID, can we geocode the title's address? Does the geocoded address fall within the new INSPIRE polygon? How many of the non-matching polygons can we match in this way?

As the data sets are big you might like to experiment with limited dataset. The goal is to be confident with a strong set of matches in a formulaic way so that we can script this.

Acceptance Criteria

a. A percentage of perfect matches (q1) b. A set of matches aimed at getting us to over 99% of polygons matching if possible. Or exploring the general conditions in which we do find matches. c. For any non-perfect matches, how many additional matches can be found using the title ID? d. A sample of properties we failed to match such that we can do some manual investigations

rogup commented 10 months ago

Things I've done so far:

Here are some graphs of the output of the analysis script, which was run on the INSPIRE data for 15 councils (just the first 15 alphabetically), on 5000 polygons from each council. There were no exact matches, which is interesting (and a bit annoying!). About 3% of the 75000 tested INSPIRE IDs where new, which is good and indicates that the govt isn't changing lots of IDs for no reason.

For each polygon old vs new, I calculated the difference between lat and long between each old vertex and corresponding new vertex. I then calculated the mean of these offsets for each vertex (lat and long), and then took the greater of these means for simplicity (since lats and longs were offset by similar order of magnitude) to get a single offset mean for each polygon. I did the same thing for each polygon's offset standard deviation.

The offset mean graph shows that the polygons in each council all seem to be offset by the same small distance, but slightly different for each council. The standard deviations are all tiny (in the order of a billionth of a degree, 10,000 times smaller than the mean offset). This indicates that these are pretty much exact offsets, with each vertex of a polygon being translated the same way.

Next steps:

intersects offset_mean offset_std

rogup commented 10 months ago

@lin-d-hop Here's an update ^ Please share any thoughts on the analysis so far and any suggestions you have.

@wu-lee I'm a bit concerned about there possibly being no backup of the polygons database, since we no longer have access to the data. It's a bit risky to be doing this analysis with lots of DB operations (albeit all read operations). I think we should make regular backups and maybe also another copy so that we can have a staging and production instance of the property-boundaries-service. Especially when we start testing writing updates in the DB. Do you know how easy it would be to make a one-off backup of the MySql DB for now, maybe even just on another user on dev-2? I'm hoping not too difficult since most of the deployment is defined by ansible?

rogup commented 10 months ago

I've had a manual look at some of the polygons that matched with an offset.

These old polygons in Adur district council seem to be slightly off, compared with the property outlines on Mapbox. But now in the new INSPIRE data, the polygons have been shifted to fit exactly in their correct position Screenshot 2023-12-30 at 01 34 38 Screenshot 2023-12-30 at 01 35 25

Similarly, in Babergh district council (the council with the biggest offset out of the ones I analysed), old polygons have been shifted to be more aligned with their expected positions (in a different direction to those in Adur). Screenshot 2023-12-30 at 02 03 45 Screenshot 2023-12-30 at 02 04 39

I also had a look at a polygon which we couldn't match with an offset, since there was a different number of vertices (367 in the old data vs 378 vertices in the new data) so we couldn't match vertex-to-vertex. But looking at the data on the map, the polygons look identical. The new vertices have been added to improve the smoothness of some of the more rounded edges. I tried offsetting by the same distance as a nearby polygon, then calculated the % intersect, and it was a >99.5% match. It looks like there many other cases like this. Screenshot 2023-12-30 at 17 02 32 Screenshot 2023-12-30 at 17 02 08

rogup commented 10 months ago

After looking at these manual cases, I refined my analysis script to:

Running this on 10,000 properties in Birmingham, there were only a handful (<1%) that had an intersect of less than 99.5%. Here's a histogram of their percentage intersects:

Screenshot 2024-01-01 at 01 31 49

I then had another manual look at some of these cases, to understand what is happening, and to indicate what % intersect threshold we should use in the final script.

Next steps:

lin-d-hop commented 10 months ago

@rogup Thanks for all of this analysis. It is really really awesome!

  1. Totally agree regarding backups. Let's action this? A monthly dump to another server would work. @rogup @wu-lee thoughts about what this backup should look like?

  2. Great to know properties mostly match and that the corrections appear to be valid improvements. This is great! Based on your examples above, I would say the 98.1% match is a full match. Fro everything less than that it would be great to understand is there are new properties for the subdivide (as you suggest). You next steps look good.

In the case of segment/merge it would be interesting to store this data, though I would say we do so in the simplest way possible, maybe even backups. Working with this historical data isn't the priority right now but at the same time let's leave the door open.

The main thing we need to try and figure out is the link between the new segment INSPIRE ID (if it now has one) and the title ID. The 'price paid' data set doesn't seem to have title ids or inspireids in it (correct me if I'm wrong). Perhaps one approach is that when there is a new title ID in the 'companies that own land' data we geocode that address and aim to match to an unmatched inspire id? For new segments that are privatelly owned I'm not sure there is any route to finding the title id but it is a good idea to try and find open data sets with title IDs.

rogup commented 10 months ago

@lin-d-hop I've made a ticket for a backup service: https://github.com/DigitalCommons/technology-and-infrastructure/issues/116

Cool, I'll have a think about a simple way to keep segment/merge data.

I think your suggested approach of geocoding new title IDs and trying to match with unmatched inspire IDs sounds like a good and simple first step. You're right that the 'price paid' data doesn't include any IDs that we can link. It provides a post code, so maybe something for the future, we could improve our script using this data + our own script's analysis to increase our confidence when piecing together information. I also think that there's the potential to eventually provide some REALLY juicy info about sales e.g. a user could see how much property a council has been selling to private companies/people. That could have some exciting use cases!

rogup commented 10 months ago

I've made some progress today on an algorithm to detect segmentation (which can sometimes be incomplete if part of the old boundary is no longer registered as an INSPIRE polygon). It has been very tricky and fiddly, but I think necessary, since there are lots of cases like this. It should be easy to extend this to merge cases.

I'm also seeing quite a few cases of boundaries slightly shifting between adjacent properties, so I want the algorithm to be able to process this too.

I just saw a weird case in Adur District council in a residential block of flats (6, 7, 9, 10, 12, 18 and 21 Woodview, Shoreham by Sea and Garages), where lots of INSPIRE polygons were overlapping/duplicates. Some polygons with the same ID have now moved around a bit to other boundaries within the block of flats but there are still duplicate polygons.

Screenshot 2024-01-10 at 19 11 12 Screenshot 2024-01-10 at 19 13 34

I think cases like this are ones that we will need to Geocode and just treat it as if the old boundary was wrong and has now been moved to the correct position.

rogup commented 10 months ago

@lin-d-hop Testing my current script on a sample of 15000 polygons from one council, it's managing to confidently match ~99.9% of them. This is more than the 99% in the acceptance criteria for this ticket.

But doing a very rough crunch of the numbers, this is still likely to give ~20,000 failed matches if we run the script on all of the millions of polygons in the INSPIRE dataset. I think it's worth me spending a few more days to finish the segmentation algorithm, merging algorithm, and detecting when boundaries between adjacent properties shift. This will greatly reduce the number of failed matches, so that it's easier to look through them manually.

Note that there are lots more changes now than there would be month-to-month once the pipeline is functioning normally, since we're comparing against data that is several years old. I think in a few days, we'll be at a point where there are maybe only a handful of edge cases each month that the script can't classify. We'll then be able to look at them manually on a case-by-case basis and decide whether to just accept/reject the change or make improvements to the script to handle them automatically in the future.

lin-d-hop commented 6 months ago

Moving forward from this R&D phase. Closing this issue and replacing with DigitalCommons/property-boundaries-service#12