DemocracyClub / EveryElection

:ballot_box_with_check: For recording every election in the UK
https://elections.democracyclub.org.uk/
BSD 3-Clause "New" or "Revised" License
11 stars 14 forks source link

Decouple `OrganisationBoundaryReview` and LGBCE Scraper #2235

Open chris48s opened 3 weeks ago

chris48s commented 3 weeks ago

Currently our OrganisationBoundaryReview model is doing 2 things.

  1. It is a source of nice clean data about boundary reviews. Some of that comes from LGBCE. Some of it we enter by hand (e.g: Community Gov reviews, Structural Change Orders, Wales/Scotland/NI stuff)
  2. It is a "mirror" of the LGBCE site which allows us to track changes over time. This allows us to (for example) fire a notification when a review moves from "in progress" to "complete"

These purposes are quite closely linked, but not exactly aligned. One big issue is that if LGBCE's site links to the wrong thing. For example (all real examples):

we can't fix that in our DB without breaking the scraper. We have to leave it wrong. Also in principle, unexpected edits to the LGBCE site can retrospectively break our data. We're not really in control of it.

With our other scraper that goes off spidering for PDFs that look like a Notice Of Election document, we flag things that might be a NoE but then a human reviews it before we create an Election object because sometimes we've scraped a Parish Council election, or a Neighbourhood Planning Referendum or something.

LGBCE's site is a bit more structured and we do have some validation in place. That said, I think it would be useful for us to separate the concept of "mirror the LGBCE site for scraping purposes" and "Nice clean Boundary Review data we can edit" with some kind of manual "Create OrganisationBoundaryReview from scraped record" process that