DemocracyClub / UK-Polling-Stations

:earth_africa: A website to help people find their UK polling station
https://wheredoivote.co.uk/
BSD 3-Clause "New" or "Revised" License
33 stars 30 forks source link

Command to update Addressbase and UprnToCouncil data #8129

Closed GeoWill closed 2 weeks ago

GeoWill commented 1 month ago

Depends on: https://github.com/DemocracyClub/uk-geo-utils/pull/30

To test it do a local checkout of the this branch and the uk-geo-utils PR, and then pip install -e path/to/uk-geo-utils . You can then run:

./manage.py update_addressbase --addressbase-path /home/will/Downloads/addressbase-2024-08-06/addressbase
_cleaned.csv --uprntocouncil-path /home/will/Downloads/addressbase-2024-08-06/uprn-to-councils.csv

The files above are available on s3 in the private data bucket on the prod account.

To make iteration cycles faster it's worth making the files shorter, however to make it work you need to have matching uprns.[1]

I've also added these to the same bucket under the addressbase/sample/ key. These can be used for testing the downloads:

    ./manage.py update_addressbase \
        --addressbase-s3-uri='s3://bucket/addressbase/sample/addressbase_cleaned/addressbase_cleaned.csv' \
        --uprntocouncil-s3-uri='s3://bucket/addressbase/sample/uprn-to-council/uprn-to-councils.csv'

At this stage this still needs to be tested in the dev environment against RDS.

This is now done it took a t3.2xlarge rds just under 17m. The time spent on the transaction was short. Not sure exactly how long though.

Assuming that works the process for updating addressbase would look like:

[1] I used miller to do this:

mlr --csv --hi --ho  filter '$1 % 599 == 0' uprn-to-councils.csv > uprn-to-councils-reduced.csv
mlr --csv --hi --ho  filter '$1 % 599 == 0' addressbase_cleaned.csv > addressbase_cleaned-reduced.csv
coveralls commented 1 month ago

Coverage Status

coverage: 71.332% (+0.4%) from 70.912% when pulling b73f81ed52c043f5fc9fa1e5a77a23d0d7f9b2d9 on feature/faster-addressbase-updates into b9e5eb5fe2c43ff317b2ee47a349613394a274f2 on master.

GeoWill commented 1 month ago

Run teardown as the last thing inside the transaction

chris48s commented 3 weeks ago

Can we make updating the docs in https://github.com/DemocracyClub/UK-Polling-Stations/wiki/Keeping-addressbase-up-to-date a follow up to this PR?

chris48s commented 3 weeks ago

Any idea what happened with the latest build failure?

chris48s commented 2 weeks ago

OK, so I've done a bit of digging on why the build is failing.. I can't really explain why, but the commit hash that get_last_import_sha_from_ssm() is picking up is a commit from a branch that has been force pushed over.

So we're trying to run

git diff --name-only e7dc69d5628a19ca314797e24e74853b14ff96fe 3b11ee3fb3dee9d5064d38824a1e5ca5597d57dc

but that's failing because e7dc69d5628a19ca314797e24e74853b14ff96fe isn't a commit that exists in the tree. The equivalent commit that got merged to master was 2e99090f900b0f1ff312940430ac196952a53586

I don't really understand why that's happening though. Seems odd given how far back in the history that is.

chris48s commented 2 weeks ago

The latest change look sensible.

I'm really confused about why those latest commits have any bearing on what get_last_import_sha_from_ssm() returns :confused:

chris48s commented 2 weeks ago

Just for code archaeology purposes, the failing "Development: Run New Imports Post Deploy" job was running on deploy to dev not this branch (we'd pushed the same commit to both this branch and the development branch). If we've got ourselves into that state by force-pushing the development branch, we can manually change the LAST_IMPORT_SHA variable in the relevant AWS account to un-stick it.