Could EE become a library?

This is a collection of thought I've been having about the future of EE that deal with a few different issues, but I think are helpful to think about together (at the moment).

The problems

There are a few of issues I'm thinking about with this.

Resolving addresses to elections

At the moment EE can only resolve elections based on postcode or point location. It doesn't offer any tooling for address look up or split postcodes.

Data modelling across DC's products

We have...at least 3 different ways to model elections (and how they map to geography) in DC. EE, YNR and WCIVF. Each are tailored to their own use case, but there isn't realy a good reason for the 3 models other than legacy / learning / ad-hoc development.

Mixing tooling and data

EE's adds value in a couple of ways: 1. a curated data package of elections (etc) that's hosted by DC and maintained by DC and a few 3rd parties. 2. A service that resolves locations to elections. 3. the logic that adds records to the elections table (the ID creator)

Embedded model of hosting

Our deployment model means we embed a copy of EE locally on EC2 instances for scaling postcode lookups, this means that we have to deal with data replication and updates somehow. At the moment we're doing a simple DB dump and restore mixed with a sync of some of the data via an API

Ideas for the future

None of these problems need to be linked, but it might be possible to think about the product in slightly different ways that address all of them nicely.

If we (conceptually) separate the data and ID creator form the look up and models then we might find that some elements of managing EE get easier.

For example, if The Electoral Commission take on the data and ID creator, we could still run our own lookup, as long as we could sync the data to our local models / servers.

If the models in EE were a Django package, we could use them in YNR and WCIVF without having to remodel them in each project.

If the data package were maintained by a 3rd party, it wouldn't matter what tech they used to create it, as long as it kept the same format that consumers expected (freeing 3rd parties to chose their own tech stacks).

Possible structure

To flesh out the concepts a little, we could have 3 elements

Library

A Django package that supplies models, base views and helpers for working with modelled election data. A set of management commands that would import and sync upstream data from some source.

Data coltroller

A set of hosted data with hosted tooling to manage the data. This is would include the ID creator and other elements that we don't have loads of tooling for right now for importing boundaries etc.

Data package

The actual list of elections, organisations, divisions, etc that describe in scope elections.

With this model, EE wouldn't need to provide a hosted resolving service, but the library could include helpers for doing this, including for AddressBase if it were available.

This is really just a starting point for talking about future options, so please dive in! @chris48s I'm (as ever) especially interested in your thoughts on this!

I don't think I have enough experience or familiarity of EE itself to feel like I can give a qualified opinion on some of this - but speaking from the perspective of working on both YNR and WCIVF, one of the biggest complications comes from the different modelling between the two products (and EE itself). So the idea that the models would be a single library that is used in multiple projects I really like. As even if we were to refactor WCIVF/YNR to a shared data model (or just something much closer) to maintain both concurrently will never be ideal.

My concerns would be making sure that we can model something that works for all products, and don't end up finding that we have to make compromises or limitations to make it work, and end up with inferior products. Clearly it would be a huge amount of planning and refactoring!

It also makes sense to me to separate the tooling/ID creator from both the data and the models - although again I say that as someone that hasn't worked with these tools. But generally this sounds like it would make it easier to maintain and support.

Inevitably I do have some thoughts on this but its probably going to take a while to churn it round in my head and write it all up. This is probably going to turn into one of my GitHub novels :books: I'll try and find some time to write up some notes.

Firstly: I think it is fairly safe to say that there is not one "grand unified theory of everything" that simultaneously solves all of these problems. In fact any solution to one thing on this list probably makes another worse :)

Secondly: Goddammit this is a highly effective nerd-snipe!

Thirdly: I'm going to start off by picking one point from the middle of your post and expanding on that as it is probably the one I have thought through the most, but I will address others at some point.

Embedded model of hosting Our deployment model means we embed a copy of EE locally on EC2 instances for scaling postcode lookups, this means that we have to deal with data replication and updates somehow. At the moment we're doing a simple DB dump and restore mixed with a sync of some of the data via an API

I think the first thing it is worth doing here is just exploring the problem space and where we are now slightly more. Maybe I'm just telling you stuff you already know here but hopefully it is useful context for @michaeljcollinsuk if nothing else. There are broadly two types of application DC runs:

"Read-only" apps like WCIVF/WDIV (not true read-only but y'know hand-wave )
CRUD apps like YNR/EE

The way we've tended to scale read-only apps is we basically (by one mechanism or another) bake all the data we need to serve a request into a front-end node. Then we can run however many nodes we need. That data is "disposable" and we minimise the number of external interactions (databases, APIs, etc) we can bottleneck on. This pattern doesn't really work for CRUD apps like EE/YNR where users performs writes as well as well as reads. Historically we didn't really have an architecture for YNR and EE that allowed them to scale to the volumes we process via WDIV/WCIVF on polling day for a major election, so an entire copy of EE's database and application code is one of the things we bundle into a front-end image for WDIV/WCIVF. We actually didn't start off doing this. We started off directly calling EE from the client apps, then one polling day EE fell over (I want to say local elections in 2017?), which then made everything else fall over.. hence the easiest thing was to keep all the code we had for communicating with EE over a JSON API and just make lots of copies of the API to "scale" it. I think that might have been a solution we rapidly cobbled together when it became clear there would be a snap GE in mid 2017?

Anyway.. history aside: There are several pain points we usually hit with this approach that make it quite fragile. Once you're running multiple copies of the EE codebase in different places:

If you make any schema change to EE you have to remember to apply it everywhere else, otherwise your pg_dump/pg_restore will start failing because the database you're trying to load into is a different shape from the data you are trying to load into it
If you make any code change to EE you have to remember to apply it everywhere else. It might not mean your cron jobs fall over if you don't but weird stuff might happen.. a bug is fixed in one place but still manifesting in another, etc
Hoofing a few Gb of boundary data over a HTTP/JSON API tends to create performance issues and they don't change often, so what we normally do is import the whole DB using pg_dump/pg_restore once and then use https://elections.democracyclub.org.uk/sync/ to sync a smaller subset of the data from the 'upstream' EE to client installs. This is fine until it is not fine. Sometimes you make a correction to the Organisations/Divisions/Geographies/etc which then stops the JSON-based sync from working until you rebuild the client DB to sync the geography/org changes too. This is normally non-obvious until a cron job starts failing. Even if you are on top of code changes/migrations, one edit in /admin can break all the syncs.
For the moment https://elections.democracyclub.org.uk/sync/ works but at some point this will outgrow a single HTTP request and it is hard to add pagination because it is not a proper JSON "endpoint" - it is just the output of ./manage.py dumpdata elections and on the other end we just import it with loaddata. Crude but effective, until it stops being effective, at which point it will be crude and broken :D

Broadly speaking, there are lots of problems that making EE's data model into a library might solve (consistent data model for EE/YNR/WCIVF is a huuuuuge one) but I don't think this is one of them. If you were to use a "library" version of EE directly in the WDIV and WCIVF codebases instead of bundling the whole application into a server image and communicating with it over HTTP/JSON, I think you fundamentally end up with some configuration of all the above problems. You have a centralised CRUD app you write to and the you're trying to keep multiple read-only copies of the code/data in sync with it via cron and duct tape. If the code/DB in the client installs diverges from the upstream code then everything falls over. The one thing that might be easier is it is slightly more obvious that you have a problem if EE is running ee-package==2.0.1 and WCIVF is still running ee-package==1.4.6 (or whatever). If you make a code change to EE and tag a release, at least you can have a bot offer to bump it in the client repos instead of having to remember which helps with one class of problem (but not all of them).

It doesn't necessarily have to be that way though. If EE could run as a completely centralised service and process enough traffic for WDIV and WCIVF to just call it directly on polling day we could stop doing all that (I'm aware this is the exact opposite of turning EE into a library :D ).

So.. one of the things I spent a bunch of time on in 2019 was architecting EE so that it can scale horizontally. The setup for this is slightly different than WCIVF/WDIV because of the read/write nature of the interactions: It uses a primary/replica RDS setup. Roughly speaking:

In order to allow EE to run on more than one frontend node, we use the same model as WCIVF to tag one instance as a "controller" when we deploy so we can set some cron jobs to only run on one node. We use https://github.com/DemocracyClub/ansible-role-commonscripts/blob/master/playbooks/tag_instances.yml to do that and https://github.com/DemocracyClub/ee_deploy/blob/master/crontab.yml ensures the relevant jobs to only run on the controller node.
EE can run on a cluster of up to 4 DB nodes (at the moment it is running on a "cluster" of one node) using AWS Route 53 with a TTL=0 as an ad-hoc "load balancer" for Postgres. There are docs on this at https://github.com/DemocracyClub/ee_deploy#databases and the application DB router handles what traffic goes where: https://github.com/DemocracyClub/EveryElection/blob/master/every_election/db_routers.py Tbh, the setup for this is a bit fiddly and could be improved: https://github.com/DemocracyClub/ee_deploy/issues/32

In reality, I don't think we've ever used this in anger, but in principle everything you need to scale EE horizontally already exists. I think if the scheduled local elections in 2020 had actually taken place I would have probably piloted scaling out EE and running either WDIV or WCIVF (but not both) directly off elections.democracyclub.org.uk but kept the local copies synced so we could have fallen back to the local install (by changing EE_BASE) if it all went to pot, but I guess you probably didn't do that in 2021 and just used local installs again.

I think for this one problem "Embedded model of hosting", that is probably the direction I would go in (or was going in) to try and solve those problems rather than a different model of embedding.

That said..

The way that WDIV uses data from Every Election is a bit different to the way that WCIVF does.

For WDIV, slinging EE a lat/lon and saying "what polygons contain this point + do any of them have a current election" on an ad-hoc basis is pretty much sufficient for what WDIV needs. Election: yes/no. Any notifications to display? Job's a good-un.
In WCIVF you actually need to store a bunch more data from EE anyway. Even if you stop running a local install of EE then WCIVF still needs to store some subset of EE's data model (possibly via YNR) anyway, so maybe the tradeoffs change there. If you're going to end up syncing some chunk of EE's database to WCIVF one way or another maybe the tradeoff of solving those problems in a good way becomes different.

In light of that perhaps it is useful to think of the way that WDIV consumes EE as different from the way WCIVF does in this context?

I'll continue to ponder this and write another chapter at some point...

The next point I will pick up is this one:

Resolving addresses to elections At the moment EE can only resolve elections based on postcode or point location. It doesn't offer any tooling for address look up or split postcodes.

This is an interesting point because it allows us to draw a useful comparison.. It was always an objective that WCIVF should solve the issue that political boundaries are not exactly described by postcodes with the same level of accuracy that WDIV does. This is an objective that has never been realised, but not for want of trying.

One of the bits of work I did a few years back was extracting the django models, import scripts and query logic for dealing with AddressBase, ONSPD and ONSUD to a shared library: https://democracyclub.github.io/uk-geo-utils/ The theory went: Lets extract all this stuff to a generic shared library that can be consumed by WDIV, WCIVF and EE. We can maintain it all in one place. If an ONSPD release makes changes we can change it once and all the other client apps that consume it inherit that update and all 3 apps will share a completely consistent data model for postcode/address data. Is this idea starting to sound familiar yet??

How did that work out? What went well? What went badly? and Why? There are probably some things we can learn from this:

First off - are we actually using it?

WCIVF: No, not at all
WDIV: Yes (with customisations)
- WDIV uses uk-geo-utils extensively
- The ONSUD bit is redundant now
- It is necessary to customise things a bit for WDIV's purposes. The necessary extension points exist in uk-geo-utils but the generic functionality doesn't exactly match WDIV's use case and the customisations WDIV needs probably wouldn't make sense for other apps
- There is definitely a tension whereby having a chunk of WDIV's data model live in an external library creates some indirection or makes things more complicated to change than they would be if this stuff was inside the WDIV codebase.
EE: partially (with customisations)
- EE uses the ONSPD data model from uk-geo-utils and OnspdGeocoder
- EE doesn't use the ONSPD import script from uk-geo-utils. The default one works fine in a packer build, but we had to write a much more complicated one for EE because it was causing downtime running the default importer against a live site.
- All in all, the subset of uk-geo-utils that EE is actually using is quite small

So.. 2 to 3 years on: Why are we here, and what can we learn from it?

Externalities: This isn't really about library or application design, but these points are still important:
- Had 2019 been a "normal" year we probably would have embarked on "use addressbase in WCIVF" as a project but it was basically impossible to have the confidence to commit to starting a big meaty "rip the entire data model out and stick a different one in" type job when there was the constant threat of having to drop everything and deal with a national election.
- After me and Joe left at the same time DC lost a lot of institutional memory. You've had to rebuild your team and this job isn't exactly a "good first issue" to onboard a new dev.
- Maybe the takeaway from this is that big data model re-organisations are hard, they take a long time and you need to have a decent chunk of time and familiarity with the application to spend on them, but I wouldn't say this stuff never happened because using the lib was a fundamentally bad idea or we didn't plan to/want to.
Different apps have different use-cases: Designing a single abstraction for Addressbase/ONSPD that exactly works for EE and WDIV with no customisations was not possible (or at least not yet realised). How cleanly it would integrate in WCIVF is still an unknown although I suspect it would need the least customisation of the 3. Even something "simple" like an import script might work well in one place and not another.
Shared models make even small changes like "we need to add one extra column" hard. As we know from popolo-django/YNR, trying to write a set of django models which are generic and can be used in multiple applications is hard to do in an extensible way. In popolo-django/YNR we ended up with a mess of OneToOne models and Generic Foreign Keys (which are now gone). In uk-geo-utils/WDIV I went for model inheritance as the method of extension for performance reasons but it has its own tradeoffs.

So.. maybe uk-geo-utils has some lessons to teach us there.

To tie it back to whether a shared lib can help solve the problem of "Resolving addresses to elections", I guess the answer is: yes, it already exists. I probably wouldn't try and lump it in with an EE lib though.

That said, I think when it comes to WCIVF having the same geography model as WDIV there's probably two approaches worth considering:

One would be WCIVF has its own copy of addressbase (like WDIV does), you ignore it in EE and both WDIV and WCIVF can query EE by point
Another option would be that EE hosts a "search" API like WDIV has (you call /postcode/AA11AA, it returns a list of addresses, you call /address/123456) and WCIVF calls it.

TBH, if it weren't for the fact that WDIV's data model is now inextricably linked to AddressBase it might be tempting to completely 'centralise' AddressBase in EE and keep the client apps 'thin', but fundamentally WDIV is always going to need its own copy internally so there's probably no dice there.

As is often the case with me, there's probably more questions than answers here but hopefully some of this is helpful.

I do have some more notes on this in bullet point form, but I haven't been able to write it all up yet - I ran out of weekend. I will try and post more stuff in the week.

The next thing I'm going to muse on is this concept (emphasis mine):

Mixing tooling and data

EE adds value in a couple of ways: [snip]

A service that resolves locations to elections.

the logic that adds records to the elections table (the ID creator)

[snip]

If we (conceptually) separate the data and ID creator form the look up and models then we might find that some elements of managing EE get easier. For example, if The Electoral Commission take on the data and ID creator, we could still run our own lookup, as long as we could sync the data to our local models / servers.

This is a super-interesting concept, but I think where you draw the lines is really important when it comes to creating the right incentive structure.

First lets drill down into "the data" as a concept. In my view, the single hardest job with EE is the boundary maintenance/keeping track of change over time and this takes several forms:

Keeping track of Electoral Change Orders (LGBCE boundary reviews)
Keeping track of Community Governance Reviews
Keeping track of other ad-hoc legislative changes (Structural Change Orders, Boundary Amendment Orders)
Assigning temporary IDs
Backporting GSS codes and "final" (as opposed to "provisional") boundaries from BoundaryLine
etc

All of these processes require careful checking as all of these data sources can have errors in them that require workarounds or follow up with data publishers. All sorts of odd problems can crop up, for example:

..and in some cases DC is still "feeling its way" when it comes to finding the right process/tooling around this e.g: https://github.com/DemocracyClub/EveryElection/issues/570

That said, to have a functional ID creator, you don't actually need boundaries at all (as in the actual polygons), but you do need to care about boundary reviews. The polygons themselves are only needed for the point --> elections lookup. So maybe lumping in maintaining the geo data in with ID creation (with a view to those functions being performed by a different organisation) isn't the right abstraction: If the organisation that maintains the boundary data doesn't really use it themselves, that seems like it creates an incentive structure that is doomed to fail. It would be entirely possible to have a 100% correct and functional ID creator that has a variety of issues with geography and identifiers. As long as you have the right names for all the divisions the ID creation bit will be fine. Also, there would be no incentive to do stuff like back-port GSS codes over temp IDs, etc.

Given that, maybe it is useful to think of the maintenance of the Organisation/Division part of the data model as attached to the "id creator" bit, but the maintenance of the OrganisationGeography/DivisionGeography part of the data model as attached to the "lookup"? That would create some additional complexity and duplication of work and require some restructuring to the data model itself, but if the "id creator" and "lookup" functions would be run by two different orgs it is probably the way to create the correct incentive structure for everyone to look after the bit they are invested in being accurate.

I think this is the last post I'm going to make on this. In this post I'm going to cover some thoughts on keeping data in sync.. and then I'm going to shut up :zipper_mouth_face:

There would definitely be a lot of benefit in EE/YNR/WCIVF having a consistent data model. As well as improving dev experience/making things conceptually more consistent, it should also make certain aspects of syncing data easier too. A lot of the work of the various tasks that sync data from EE -> YNR and YNR -> WCIVF is transforming the data from one schema to another. If they are all sharing consistent (or similar) database schemas, that definitely gives you the potential to streamline that stuff a lot.

That said, shunting all this data from A to B is hard. The current processes are pretty janky and this job represents an opportunity (or perhaps a requirement) to get our house in shape on this. Some thoughts on syncing data between applications:

At the moment, you can get away with pg_dump-ing EE's database and pg_restore-ing it into another EE database somewhere else. Once your Organisation/OrganisationGeography/Division/DivisionGeography/Election/etc tables are some tables inside YNR's Database or WCIVF's database (probably with FK relationships to other tables in that DB) instead of having their own DB to themselves, you won't be able to get away with that. The dumpdata/loaddata trick might still be viable, but it might not.
JSON over HTTP is not a super efficient way to transfer large boundaries at full precision e.g https://elections.democracyclub.org.uk/api/organisations/sp/sp/1999-05-06/geo.json
The qualities that make it efficient to yoink a large pile of data out of one DB and into another aren't necessarily the same qualities that make a nice API for end users. An efficient bulk or transactional data sync format and an intuitive public API might be two different things.
It might be important to research patterns for transactional/incremental updates in a good way.. Pub/sub? Event sourcing? Data diffs? I don't know what the answer is tbh, but it is probably worth doing a bunch of reading/experimenting.
Keeping things in sync within your own organisation is already hard. Doing it across multiple organisations will be harder.
I've touched on the concept of how to extend shared models. As well as considering code complexity and performance (see https://github.com/DemocracyClub/yournextrepresentative/blob/f2f5c25ed40f4332749ae5afeed0c0f9c3bed2c1/ynr/apps/bulk_adding/tests/test_bulk_add.py#L164 ) tradeoffs, the solution you pick for how you extend the models may have an impact on how you sync data from one system to another too.
If you make a shared lib, it probably makes sense to implement it in YNR first. In terms of incremental adoption,
- EE --> YNR (with an identical model to EE) --> WCIVF (with a completely different data model) makes way more sense as a first step than
- EE --> YNR (with a completely different data model) --> WCIVF (back to an identical model to EE)

DemocracyClub / EveryElection