Update definition of 'Large' in Planit

Robinlovelace commented 3 years ago

Split out from #5. This is not necessary for hitting the MVP but worth doing. If you could outline the steps required for this @aspeakman that would be very useful, here's my starter for 10 (feel free to edit):

[ ] Analyse the existing large datasets to find out which are not actually large
[ ] Identify ways of reclassifying them in the data
[ ] Update the PlanIt codebase

aspeakman commented 3 years ago

Implementing reclassification and updating the PlanIt codebase are relatively easy - as a re-build of the database takes place each week.

However working through the >100k identified as Large is not really feasible, especially as our definition of what constitutes a a planning application related to a major development is quite hazy.

As you say in #5 I think the first step is to look at the lists provided by Joey - to identify those NOT classified as Large (false negatives) and then work out why. A second step I am working on is to develop some boundary based searches for applications within the vicinity of the known development sites in the lists - this will provide manageable lists which can be scanned to identify both false positives and false negatives.

Robinlovelace commented 3 years ago

OK thanks Andrew, good basis for discussion soon to tie down next steps.

mvl22 commented 3 years ago

A second step I am working on is to develop some boundary based searches for applications within the vicinity of the known development sites in the lists

This is the kind of thing that a Postgres database will make easy, as it has full OpenGIS support.

aspeakman commented 3 years ago

Indeed.

Although in defence of the existing API you can already get JSON data for applications enclosed within any polygon (using a MongoDB spatial search) - see the 'boundary' parameter.

Robinlovelace commented 3 years ago

Just to check @aspeakman is this something you're planning to work/any updates? Just trying to get back up to speed with this and discussing definitions with @joeytalbot. If you're happy with the definition you have already and don't plan to update it in the next week or so we can close this and focus on post-processing the data you've you've already provided in #5.

In other news, happy new year!

aspeakman commented 3 years ago

I am happy we have shown that the 'Large' definition is well controlled - in that the coverage data show most planning authorities supply the underlying fields and the classification can be made in most cases. I have also ensured that any changes to the definition could be implemented across the board relatively quickly if we decide what the new parameters are.

However I dont think the Large definition currently meets the needs of this project to consistently identify certain types of development. I will comment further in #5 on this

Robinlovelace commented 3 years ago

Sounds good @aspeakman. Trying to figure out if there is any ongoing action to be done in the context of this project and its relatively short timelines. I assume not at the moment, correct?

If not I suggest closing this issue for now and continuing documenting work done and discussion in #5 as you say.

mvl22 commented 3 years ago

I have also ensured that any changes to the definition could be implemented across the board relatively quickly if we decide what the new parameters are.

I remain strongly of the view that a very high priority in this would be to ensure that public comments, which puff up the numbers, are excluded from counts.

Example, see this application local to me, which is very highly contentious, but is basically a small land area for some student flats, so is probably 'Medium'. It has some of objections shown in the document listing (and that excludes the 251(!) online comments): https://applications.greatercambridgeplanning.org/online-applications/applicationDetails.do?activeTab=documents&keyVal=QJBSP8DX0HK00

At least in the case of the IDOX implementation, these can easily be filtered out as they have the consistent "Third Party Comments" category.

Robinlovelace commented 3 years ago

Heads-up @aspeakman are you working on this? @mvl22's suggest sounds sensible to me:

ensure that public comments, which puff up the numbers, are excluded from counts.

We will all benefit from an updated list of large sites.

joeytalbot commented 3 years ago

I can sanity check the Large planning applications you identified for the 25 known sites

aspeakman commented 3 years ago

Yes sorry for going quiet, I have been working on implementing full text searching and improved spatial queries within the PlanIt database so that I can query the 30 or so areas more effectively

Now that we have those areas defined, my proposal would be to produce a csv dump or similar of ALL the applications within those areas (depending upon numbers) so we can review fields like description, document count, app_size and application type against existing Large classification. This should be do-able soon.

Scoping out the effect of taking up Martin's suggestion of removing certain doc types should be part of this. However I am sceptical because in these 30 areas I think we have too few Large applications not too many. But also because it is not universally applicable (even across different Idox scrapers) and scanning document titles/types will slow down the scraping.

My suggestion for a better approach might be to work on a new classification of the description field based on a list of search terms - for example 'nnn x dwellings', 'nnn no. apartments', 'nnn student flats'

Robinlovelace commented 3 years ago

Hi @aspeakman thanks for getting back. I think this suggestion is a great idea:

Now that we have those areas defined, my proposal would be to produce a csv dump or similar of ALL the applications within those areas (depending upon numbers) so we can review fields like description, document count, app_size and application type against existing Large classification. This should be do-able soon.

Can you provide a timeline/ETA? It's good to have deadlines to work to. End of the week would be ideal but whenever doable, just want to get a hand on timelines. A full data dump for this project, representing a good snapshot of planning activity in a good % of the UK is a great idea IMO.

aspeakman commented 3 years ago

Should be able to let you know numbers today so we can see if it is practical and the full dump by the end of the week

Robinlovelace commented 3 years ago

Great, thanks @aspeakman

aspeakman commented 3 years ago

There are approx 6400 applications located within these areas, which is a manageable number

I have updated the table with counts, see

https://gist.github.com/aspeakman/989994ec957da57640610d9aa1cd0939

Robinlovelace commented 3 years ago

Just to clarify one thing @aspeakman, were you talking about all areas for which you have planning data in the UK?

the applications within those areas (depending upon numbers)

I assumed so but wondering now if you meant the ~30 areas provided by @joeytalbot. Apologies for mis-understanding. There is also a national element to the project that requires a dump of all large applications (or applications that could be deemed large).

aspeakman commented 3 years ago

Yes I was talking about the 30 areas.

I think I have already done the Large dump (so to speak)

See https://github.com/cyipt/actdev/releases/tag/0.1.1

Robinlovelace commented 3 years ago

I think I have already done the Large dump (so to speak)

:laughing:

Can you create an updated dump after implementing this?:

Martin's suggestion of removing certain doc types should be part of this.

aspeakman commented 3 years ago

I have now created a list of all 6326 planning applications located within the 35 defined zones ( Sites35Applics.csv attached to the 0.1.1 release https://github.com/cyipt/actdev/releases/tag/0.1.1). Not all data fields are included in the dump but the key ones (description, application type, number of documents, number of statutory days) are - if you want any other fields included for information let me know.

My plan for the next week or two is to analyse this list looking for false positives and negatives with respect to the Large designation. I am working a little blind here based on my own pre-conceptions of the kind of development you are interested in so any feedback would be useful, including known applications within the list that are already being missed or falsely included. As I said my preliminary approach will be to look in the description field for phrases indicating >100? new dwellings are being planned.

Once this is done I will hopefully have some rules plus lists of applications that would be included/excluded. I will also be able to compare to the number of documents to assess the effects of Martin's suggestion.

joeytalbot commented 3 years ago

Thanks @aspeakman that sounds really great. I'll have a look to see if any obvious applications are missing are if anything else jumps out to me.

I noticed you have fields for both app_type and application_type. Is application_type the original type as stated online, and app_type your categorised simplification of this?

Identifying the number of dwellings proposed would be really helpful. It's tricky because often the number is not mentioned in the description field. Also, other numbers may sometimes be mentioned in the description field, such as plot numbers the application relates to.

One other useful way to find Large applications could be to identify related applications. Sometimes these are formally listed online as 'related applications'. In other cases, they may be mentioned in the description field. Eg: "Approval of reserved matters pursuant to outline permission 03/02386/AOP regarding the construction of the access road to serve the employment area and ancillary works." In this case, I expect 03/02386/AOP is probably a Large application.

joeytalbot commented 3 years ago

Relating to my previous comment: it's the associated_id field.

And searching your list, application 03/02386/AOP does not appear. I know it's quite old (2003) but there are other applications from the same year and the same site - Aylesbury Vale - in your list. 03/02386/AOP is listed in the associated_id field for several other subsequent applications. It very much looks like a key application for the site.

aspeakman commented 3 years ago

Yes 'app_type', 'app_size' and 'app_state' are the derived fields which are set to simplified categories. In the CSV I have included 'n_documents', n_statutory_days' and 'application_type' at the top level because they are the three major source field inputs to making those categorisations - normally they are included in the 'other_fields' grab bag.

I agree with you that searching for number of dwellings etc is definitely not going to solve all issues but it might be the start of a two pass solution 1. find all original large scale applications 2. then find all associated applications using 'associated_id'. In any case I think it is going to be a distinct improvement on the current 'Large' category which is mostly based on a hunch about number of documents.

The application 03/02386/AOP (dated September 2003) has the phrase '3000 dwellings' in the description which is encouraging for my approach. The reason it has not been gathered in is that December 2003 is the current default back stop date for Idox scrapers - every time I move this date backwards for all such scrapers it causes a large hiccup in the process as there are > 200 scrapers that try to gather historical data. However I will set it individually for Aylesbury Vale so that should appear soon.

joeytalbot commented 3 years ago

Thanks, I think the associated_id field is a sure-fire way to find the most important applications for any given site. They will be the ones that appear more frequently in the associated_id.

aspeakman commented 3 years ago

03/02386/AOP will get classified as Large under current rules (it has 144 documents and n_statutory_days is 111)

Robinlovelace commented 3 years ago

Updates from meeting today - 2 strands of analysis:

a)

Look at the ~6k applications associated with the sites - first pass of analysis by @aspeakman - end of week first pass
Identify issues in the 'first pass' analysis - @joeytalbot
Iterate and potentially update the rules for b)

b)

Start analysis of national planning data, with a view to identifying linked ones and very large applications - @Robinlovelace and/or @joeytalbot.

joeytalbot commented 3 years ago

So am I right in thinking that your associated_id field is derived from text analysis of the description, and that you don't currently scrape for related applications @aspeakman ?

aspeakman commented 3 years ago

Yes @joeytalbot associated_id is derived from pattern matching in the description only - across the UK planning sites there is no consistently maintained set of fields that link to related planning applications.

aspeakman commented 3 years ago

Have completed a scan of all 165 Large applications in the 35 sites and there are several currently classified as Large that we could classify as not relevant to this project - this includes large non-residential sites (eg a recycling plant). Quite a few of the residentail projects mention numbers of units to be built - but what would the threshold number for a large development be? But also as @joeytalbot points out many of these are "variation of conditions" - planning applications subsequent to the original in that area.

We could tighten this up but I am unclear where we are heading with these data. Is our aim just to find the original application at the start of a large infrastructure development ? ie to say when it begins and flag up prospective locations across the UK? If this the case I would think having loose criteria is best and then tracing back via associated ids (probably via an offline filtering process)

Or is it to flag up associated applications throughout the life cycle of the project? ie filter live planning data so we can say a particular application is part of a development - in which case criteria would have to be tighter so we don't get flooded out with false positives (but also I would think we would want to keep "variation of conditions" applications).

joeytalbot commented 3 years ago

That's a good question. I think ideally we probably want to identify the key applications that set out the plans for a site (these will often be linked to several other applications through the associated_id, but not always, e.g. in the case of a brand new application, or one that ultimately gets rejected).

But then we also want to identify the other applications that are linked to the same site. Things like variations of conditions are often very minor changes, but sometimes they might be more substantial, and this is useful for following the course of development within a site. We would need a way of identifying these related applications. This might be through a new field, or by some kind of use of associated_id.

We should also bear in mind that these 35 sites are not representative of planning applications across the UK as a whole, because we are targeting major new development sites.

aspeakman commented 3 years ago

So my proposal is to keep the definition of 'Large' used in PlanIt quite general and inclusive - it should aim to flag any prospective applications with large scale - including non-residential eg new hospital, recycling facility - and also subsequent "variation" applications. But it should include all residential developments of more than 40? 60? 90? units

These will then need to be filtered to find the really large residential sites of interest to this project (Major Development ?) and/or work back to the original application in a series of variations.

joeytalbot commented 3 years ago

You could say 50 dwellings as a cut-off, but it's common for applications to relate to a small number of dwellings that form a zone within a larger site that contains many more. An application for 50 detached 4-bed houses is also quite a different prospect to a new student hall containing 50 bedrooms.

aspeakman commented 3 years ago

Yes the Didcot site seems to have quite a few parcels of development for 10 to 20 houses - but I think there are other indicators like number of documents to flag these

joeytalbot commented 3 years ago

I am looking through the major planning applications that are currently being missed. I'll send details soon.

joeytalbot commented 3 years ago

Hi @aspeakman, so far the large classification is very variable from one site to the next. In some sites, relatively minor applications are being picked up, but in several of the sites not a single application is being shown.

For the subset of sites which I have previously studied in greater depth, I have checked through and identified many of the false negatives that should be classified as large but are not showing in your list. The missing applications include some as large as an outline planning application for 4000 dwellings.

You can see the results at https://github.com/cyipt/actdev/releases/download/0.1.1/missed-applications.xlsx

I have not included any applications with dates prior to December 2003.

Robinlovelace commented 3 years ago

Heads-up, I'm making a start on this sub-task in this mega issue:

Start analysis of national planning data, with a view to identifying linked ones and very large applications - @Robinlovelace and/or @joeytalbot.

aspeakman commented 3 years ago

Thanks @joeytalbot for the list of 35 missing - I will use them as exemplars when I do my own trawling in the 6000 Medium and Small ones.

aspeakman commented 3 years ago

I have now completed a trawl through 6380 applications within the 35 zones of interest. And a further 47 notified to me that were missing from this sample. Spreadsheet of these data available on request.

My summary:

Of the 47 applications missing this was mostly because the application did not have location information (hence was not known to be within a zone)
Of the 6380 I found around 400 "false negatives" = planning applications which refer to a large residential development but are not tagged as Large.
Partly this is because some authorities/applications do not have key information the classification currently expects eg number of documents or days before a decision is due
However many of the "false negatives" have phrases referring to number of "dwellings" (but terminology varies) OR to a new development e.g. "garden village", "residential quarter"

To make progress I am now moving on to see if the key number which refers to "dwellings" can realistically be extracted from the description field as an indicator to use in classification. Note this number varies widely - so I think it would be a mistake to build in a fixed threshold (say of 50) at this stage.

Robinlovelace commented 3 years ago

How are you regex skills? Sounds like a mammoth text search mission but definitely solvable using some of the awesome resources outlined here (this is over my head and guess you're aware of such things Andrew but sharing in case of use/interest): https://github.com/Varunram/Awesome-Regex-Resources#books

aspeakman commented 3 years ago

The current app_state, app_type classifications all use regex - so my skills are there - but it's definitely a challenge especially to cater for all the variations eg "2,300 new mixed-tenure dwelling"

Robinlovelace commented 3 years ago

Machine learning challenge? We could use the examples above as a training dataset so the relative importance of different words/phrases in assigning probability of 'Large' is derived from the data. May be overkill but here's an example of training a TensorFlow model to predict the probability of a review being positive based on free text - it could be a 'run once' model that is then used internally. https://blogs.rstudio.com/ai/posts/2017-12-07-text-classification-with-keras/

Lots of other options available, interested to hear what others think about this but ad hoc regex certainly has it's limits!

Robinlovelace commented 3 years ago

LDA is a smaller and simpler method to classify texts that could be of use and easier to run than TensorFlow. Excellent tutorial based on free text (novels) by Jane Austen: https://www.tidytextmining.com/topicmodeling.html

joeytalbot commented 3 years ago

Of the 47 applications missing this was mostly because the application did not have location information (hence was not known to be within a zone)

The lack of location information sounds worrying. When I checked through those sites, I'd estimate more than half of the applications I was looking for were missing from your list. So if that many are lacking location data, it could be a problem. Can we use other ways to identify them?

mvl22 commented 3 years ago

Of the 47 applications missing this was mostly because the application did not have location information (hence was not known to be within a zone)

Andrew, can you give me an example ID?

I wonder whether, as a proxy, they should at least be put somewhere in the boundary of the local authority, e.g. centroid. That at least means they are somewhere. (I suppose for the ActDev project specifically though, that might not be particularly useful, but at least it makes the record itself have some value for other uses.)

joeytalbot commented 3 years ago

I wonder whether, as a proxy, they should at least be put somewhere in the boundary of the local authority, e.g. centroid. That at least means they are somewhere. (I suppose for the ActDev project specifically though, that might not be particularly useful, but at least it makes the record itself have some value for other uses.)

Even if they don't have proper location data, might they have a postcode?

Robinlovelace commented 3 years ago

Even if they don't have proper location data, might they have a postcode?

Yes and addresses can be geocoded (although tricky if they are yet to exist). Here's a regex to identify UK postcodes:

^([A-Z][A-HJ-Y]?\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

Source: https://stackoverflow.com/questions/164979/regex-for-matching-uk-postcodes

aspeakman commented 3 years ago

PlanIt takes an explicit location if supplied, but always falls back to a location derived from any postcode in the address. Of course the problem with new developments is that they are often on land which is unassigned in the postcode database at the time. Example Wiltshire/W/10/01964/OUT = Land North East Of Snowberry Lane And South Of, Sandridge Road, Melksham, Wiltshire

Joey I will send you the full spreadsheet for info

Robinlovelace commented 3 years ago

Just for the fun of it I tried testing out the regex shown. Reproducible example:

txt = c("blurb blurb HR4 7BP",
        "blurb 34blurb",
        "LS2 9JT",
        " LS2 9JT",
        "LS2 9JT "
        )
rgx = r"(([A-Z][A-HJ-Y]?\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2}))"
stringr::str_extract(txt, rgx)
#> [1] "HR4 7BP" NA        "LS2 9JT" "LS2 9JT" "LS2 9JT"

^{Created on 2021-02-05 by the reprex package (v1.0.0)}

aspeakman commented 3 years ago

The postcode regex is already built into PlanIt.

I have previously examined geocoding of addresses without postcodes, but I found the results were disappointing (and time consuming).

It is only certain authorities which don't have the location automatically included (eg Wiltshire, Northamptonshire), so if I can extract the number of dwellings, then I think one workable approach might be to flag up major developments (eg > 200 dwellings) with no assigned location. This could then be used to track them down manually.

mvl22 commented 3 years ago

I have previously examined geocoding of addresses without postcodes

I've tried the same on another project and similarly got disappointing results. It's basically a very hard problem to solve.

I suspect the only way you would get that to work is using Google's geocoder, which would be problematic from a licensing perspective.

aspeakman commented 3 years ago

Have now completed regex to extract an 'n_dwellings' value from planning applications in the 35 regions of interest. This flagged up a further 200 or so applications (704 in total out of 6379). Attached is a histogram to show the distribution of values - with an initial peak under 50 and a long tail of larger ones. Still some work to do to roll this out - it seems to be quite time consuming so I need to check it will be OK on the live setup before proceeding. I am also working on a search for key non-numeric phrases including "garden village", "quarter", "urban extension", "development area" (other suggestions welcome)

N_dwellings

cyipt / actdev

Update definition of 'Large' in Planit #13