DemocracyClub / EveryElection

:ballot_box_with_check: For recording every election in the UK
https://elections.democracyclub.org.uk/
BSD 3-Clause "New" or "Revised" License
11 stars 14 forks source link

Could EE become a library? #1124

Closed symroe closed 1 year ago

symroe commented 3 years ago

This is a collection of thought I've been having about the future of EE that deal with a few different issues, but I think are helpful to think about together (at the moment).

The problems

There are a few of issues I'm thinking about with this.

Resolving addresses to elections

At the moment EE can only resolve elections based on postcode or point location. It doesn't offer any tooling for address look up or split postcodes.

Data modelling across DC's products

We have...at least 3 different ways to model elections (and how they map to geography) in DC. EE, YNR and WCIVF. Each are tailored to their own use case, but there isn't realy a good reason for the 3 models other than legacy / learning / ad-hoc development.

Mixing tooling and data

EE's adds value in a couple of ways: 1. a curated data package of elections (etc) that's hosted by DC and maintained by DC and a few 3rd parties. 2. A service that resolves locations to elections. 3. the logic that adds records to the elections table (the ID creator)

Embedded model of hosting

Our deployment model means we embed a copy of EE locally on EC2 instances for scaling postcode lookups, this means that we have to deal with data replication and updates somehow. At the moment we're doing a simple DB dump and restore mixed with a sync of some of the data via an API

Ideas for the future

None of these problems need to be linked, but it might be possible to think about the product in slightly different ways that address all of them nicely.

If we (conceptually) separate the data and ID creator form the look up and models then we might find that some elements of managing EE get easier.

For example, if The Electoral Commission take on the data and ID creator, we could still run our own lookup, as long as we could sync the data to our local models / servers.

If the models in EE were a Django package, we could use them in YNR and WCIVF without having to remodel them in each project.

If the data package were maintained by a 3rd party, it wouldn't matter what tech they used to create it, as long as it kept the same format that consumers expected (freeing 3rd parties to chose their own tech stacks).

Possible structure

To flesh out the concepts a little, we could have 3 elements

Library

A Django package that supplies models, base views and helpers for working with modelled election data. A set of management commands that would import and sync upstream data from some source.

Data coltroller

A set of hosted data with hosted tooling to manage the data. This is would include the ID creator and other elements that we don't have loads of tooling for right now for importing boundaries etc.

Data package

The actual list of elections, organisations, divisions, etc that describe in scope elections.

With this model, EE wouldn't need to provide a hosted resolving service, but the library could include helpers for doing this, including for AddressBase if it were available.

This is really just a starting point for talking about future options, so please dive in! @chris48s I'm (as ever) especially interested in your thoughts on this!

michaeljcollinsuk commented 3 years ago

I don't think I have enough experience or familiarity of EE itself to feel like I can give a qualified opinion on some of this - but speaking from the perspective of working on both YNR and WCIVF, one of the biggest complications comes from the different modelling between the two products (and EE itself). So the idea that the models would be a single library that is used in multiple projects I really like. As even if we were to refactor WCIVF/YNR to a shared data model (or just something much closer) to maintain both concurrently will never be ideal.

My concerns would be making sure that we can model something that works for all products, and don't end up finding that we have to make compromises or limitations to make it work, and end up with inferior products. Clearly it would be a huge amount of planning and refactoring!

It also makes sense to me to separate the tooling/ID creator from both the data and the models - although again I say that as someone that hasn't worked with these tools. But generally this sounds like it would make it easier to maintain and support.

chris48s commented 3 years ago

Inevitably I do have some thoughts on this but its probably going to take a while to churn it round in my head and write it all up. This is probably going to turn into one of my GitHub novels :books: I'll try and find some time to write up some notes.

chris48s commented 3 years ago

Firstly: I think it is fairly safe to say that there is not one "grand unified theory of everything" that simultaneously solves all of these problems. In fact any solution to one thing on this list probably makes another worse :)

Secondly: Goddammit this is a highly effective nerd-snipe!

Thirdly: I'm going to start off by picking one point from the middle of your post and expanding on that as it is probably the one I have thought through the most, but I will address others at some point.

Embedded model of hosting Our deployment model means we embed a copy of EE locally on EC2 instances for scaling postcode lookups, this means that we have to deal with data replication and updates somehow. At the moment we're doing a simple DB dump and restore mixed with a sync of some of the data via an API

I think the first thing it is worth doing here is just exploring the problem space and where we are now slightly more. Maybe I'm just telling you stuff you already know here but hopefully it is useful context for @michaeljcollinsuk if nothing else. There are broadly two types of application DC runs:

The way we've tended to scale read-only apps is we basically (by one mechanism or another) bake all the data we need to serve a request into a front-end node. Then we can run however many nodes we need. That data is "disposable" and we minimise the number of external interactions (databases, APIs, etc) we can bottleneck on. This pattern doesn't really work for CRUD apps like EE/YNR where users performs writes as well as well as reads. Historically we didn't really have an architecture for YNR and EE that allowed them to scale to the volumes we process via WDIV/WCIVF on polling day for a major election, so an entire copy of EE's database and application code is one of the things we bundle into a front-end image for WDIV/WCIVF. We actually didn't start off doing this. We started off directly calling EE from the client apps, then one polling day EE fell over (I want to say local elections in 2017?), which then made everything else fall over.. hence the easiest thing was to keep all the code we had for communicating with EE over a JSON API and just make lots of copies of the API to "scale" it. I think that might have been a solution we rapidly cobbled together when it became clear there would be a snap GE in mid 2017?

Anyway.. history aside: There are several pain points we usually hit with this approach that make it quite fragile. Once you're running multiple copies of the EE codebase in different places:

Broadly speaking, there are lots of problems that making EE's data model into a library might solve (consistent data model for EE/YNR/WCIVF is a huuuuuge one) but I don't think this is one of them. If you were to use a "library" version of EE directly in the WDIV and WCIVF codebases instead of bundling the whole application into a server image and communicating with it over HTTP/JSON, I think you fundamentally end up with some configuration of all the above problems. You have a centralised CRUD app you write to and the you're trying to keep multiple read-only copies of the code/data in sync with it via cron and duct tape. If the code/DB in the client installs diverges from the upstream code then everything falls over. The one thing that might be easier is it is slightly more obvious that you have a problem if EE is running ee-package==2.0.1 and WCIVF is still running ee-package==1.4.6 (or whatever). If you make a code change to EE and tag a release, at least you can have a bot offer to bump it in the client repos instead of having to remember which helps with one class of problem (but not all of them).

It doesn't necessarily have to be that way though. If EE could run as a completely centralised service and process enough traffic for WDIV and WCIVF to just call it directly on polling day we could stop doing all that (I'm aware this is the exact opposite of turning EE into a library :D ).

So.. one of the things I spent a bunch of time on in 2019 was architecting EE so that it can scale horizontally. The setup for this is slightly different than WCIVF/WDIV because of the read/write nature of the interactions: It uses a primary/replica RDS setup. Roughly speaking:

In reality, I don't think we've ever used this in anger, but in principle everything you need to scale EE horizontally already exists. I think if the scheduled local elections in 2020 had actually taken place I would have probably piloted scaling out EE and running either WDIV or WCIVF (but not both) directly off elections.democracyclub.org.uk but kept the local copies synced so we could have fallen back to the local install (by changing EE_BASE) if it all went to pot, but I guess you probably didn't do that in 2021 and just used local installs again.

I think for this one problem "Embedded model of hosting", that is probably the direction I would go in (or was going in) to try and solve those problems rather than a different model of embedding.

That said..

The way that WDIV uses data from Every Election is a bit different to the way that WCIVF does.

In light of that perhaps it is useful to think of the way that WDIV consumes EE as different from the way WCIVF does in this context?

I'll continue to ponder this and write another chapter at some point...

chris48s commented 3 years ago

The next point I will pick up is this one:

Resolving addresses to elections At the moment EE can only resolve elections based on postcode or point location. It doesn't offer any tooling for address look up or split postcodes.

This is an interesting point because it allows us to draw a useful comparison.. It was always an objective that WCIVF should solve the issue that political boundaries are not exactly described by postcodes with the same level of accuracy that WDIV does. This is an objective that has never been realised, but not for want of trying.

One of the bits of work I did a few years back was extracting the django models, import scripts and query logic for dealing with AddressBase, ONSPD and ONSUD to a shared library: https://democracyclub.github.io/uk-geo-utils/ The theory went: Lets extract all this stuff to a generic shared library that can be consumed by WDIV, WCIVF and EE. We can maintain it all in one place. If an ONSPD release makes changes we can change it once and all the other client apps that consume it inherit that update and all 3 apps will share a completely consistent data model for postcode/address data. Is this idea starting to sound familiar yet??

How did that work out? What went well? What went badly? and Why? There are probably some things we can learn from this:

First off - are we actually using it?

So.. 2 to 3 years on: Why are we here, and what can we learn from it?

So.. maybe uk-geo-utils has some lessons to teach us there.

To tie it back to whether a shared lib can help solve the problem of "Resolving addresses to elections", I guess the answer is: yes, it already exists. I probably wouldn't try and lump it in with an EE lib though.

That said, I think when it comes to WCIVF having the same geography model as WDIV there's probably two approaches worth considering:

TBH, if it weren't for the fact that WDIV's data model is now inextricably linked to AddressBase it might be tempting to completely 'centralise' AddressBase in EE and keep the client apps 'thin', but fundamentally WDIV is always going to need its own copy internally so there's probably no dice there.

As is often the case with me, there's probably more questions than answers here but hopefully some of this is helpful.

chris48s commented 3 years ago

I do have some more notes on this in bullet point form, but I haven't been able to write it all up yet - I ran out of weekend. I will try and post more stuff in the week.

chris48s commented 3 years ago

The next thing I'm going to muse on is this concept (emphasis mine):

Mixing tooling and data

EE adds value in a couple of ways: [snip]

  1. A service that resolves locations to elections.
  2. the logic that adds records to the elections table (the ID creator)

[snip]

If we (conceptually) separate the data and ID creator form the look up and models then we might find that some elements of managing EE get easier. For example, if The Electoral Commission take on the data and ID creator, we could still run our own lookup, as long as we could sync the data to our local models / servers.

This is a super-interesting concept, but I think where you draw the lines is really important when it comes to creating the right incentive structure.

First lets drill down into "the data" as a concept. In my view, the single hardest job with EE is the boundary maintenance/keeping track of change over time and this takes several forms:

All of these processes require careful checking as all of these data sources can have errors in them that require workarounds or follow up with data publishers. All sorts of odd problems can crop up, for example:

..and in some cases DC is still "feeling its way" when it comes to finding the right process/tooling around this e.g: https://github.com/DemocracyClub/EveryElection/issues/570

That said, to have a functional ID creator, you don't actually need boundaries at all (as in the actual polygons), but you do need to care about boundary reviews. The polygons themselves are only needed for the point --> elections lookup. So maybe lumping in maintaining the geo data in with ID creation (with a view to those functions being performed by a different organisation) isn't the right abstraction: If the organisation that maintains the boundary data doesn't really use it themselves, that seems like it creates an incentive structure that is doomed to fail. It would be entirely possible to have a 100% correct and functional ID creator that has a variety of issues with geography and identifiers. As long as you have the right names for all the divisions the ID creation bit will be fine. Also, there would be no incentive to do stuff like back-port GSS codes over temp IDs, etc.

Given that, maybe it is useful to think of the maintenance of the Organisation/Division part of the data model as attached to the "id creator" bit, but the maintenance of the OrganisationGeography/DivisionGeography part of the data model as attached to the "lookup"? That would create some additional complexity and duplication of work and require some restructuring to the data model itself, but if the "id creator" and "lookup" functions would be run by two different orgs it is probably the way to create the correct incentive structure for everyone to look after the bit they are invested in being accurate.

chris48s commented 3 years ago

I think this is the last post I'm going to make on this. In this post I'm going to cover some thoughts on keeping data in sync.. and then I'm going to shut up :zipper_mouth_face:

There would definitely be a lot of benefit in EE/YNR/WCIVF having a consistent data model. As well as improving dev experience/making things conceptually more consistent, it should also make certain aspects of syncing data easier too. A lot of the work of the various tasks that sync data from EE -> YNR and YNR -> WCIVF is transforming the data from one schema to another. If they are all sharing consistent (or similar) database schemas, that definitely gives you the potential to streamline that stuff a lot.

That said, shunting all this data from A to B is hard. The current processes are pretty janky and this job represents an opportunity (or perhaps a requirement) to get our house in shape on this. Some thoughts on syncing data between applications: