Urban-Analytics-Technology-Platform / popgetter

https://popgetter.readthedocs.io/en/latest/
Apache License 2.0
5 stars 1 forks source link

UK data sources #2

Open dabreegster opened 1 year ago

dabreegster commented 1 year ago

Goal: explore OA-level 2020 data. A possible use case could be adding as a layer in the LTN and 15m tools.

Geometry

Fernando made this for SPC: https://ramp0storage.blob.core.windows.net/nationaldata-v2/GIS/OA_2011_Pop20.geojson

But let's track down the originals... https://geoportal.statistics.gov.uk/search?collection=Dataset&sort=name&tags=all(BDY_OA%2CDEC_2021) Full extent, full clipped, or generalised clipped.

dabreegster commented 1 year ago

Data

1) Select a topic through https://www.ons.gov.uk/census/maps/choropleth?oa=E00091036 (and can use this to see what's at OA level and immediately visualize it!) 2) "Download data" to get to something like https://www.nomisweb.co.uk/datasets/c2021ts006 3) Go through the NOMIS UI manually, select OA geography, all the stuff, etc, download CSV

Is there a way to

dabreegster commented 1 year ago

https://github.com/virgesmith/UKCensusAPI is an option. I'm trying to work out if simple bulk downloads can just be built in the browser. https://www.nomisweb.co.uk/api/v01/help

dabreegster commented 1 year ago

Number of cars: https://www.ons.gov.uk/filters/432658ed-9740-46f4-9565-e0321181ea75/dimensions, or https://www.ons.gov.uk/datasets/TS045/editions/2021/versions/3/filter-outputs/a20437fb-ae7f-439b-bc91-de261335038b#get-data. I tried generating the API link through https://www.nomisweb.co.uk, but got an internal error of some sort.

For now, I'd be fine doing the data downloads manually.

But now it's time to figure out how to more compactly represent all the datasets for the A/B Street map model.

dabreegster commented 1 year ago

Disability only available at LAD level: https://www.ons.gov.uk/census/maps/choropleth/health/disability-age-standardised/disability-4a/disabled-under-the-equality-act-day-to-day-activities-limited-a-lot?oa=E00091042

dabreegster commented 1 year ago

How do you actually get the number of people in an OA? https://www.ons.gov.uk/census/maps/choropleth/population/population-density/population-density/persons-per-square-kilometre?oa=E00091042 and then multiply by the area?

dabreegster commented 1 year ago

E00104221 (and many other areas) have a point crossing itself in the final clipped GeoJSON: Screenshot from 2023-03-15 10-08-06

Where in the pipeline is this being introduced?

dabreegster commented 1 year ago

The TopoJSON is over-simplified. Need to look into the mapshaper options! Screenshot from 2023-03-15 10-11-50 Screenshot from 2023-03-15 10-11-44

andrewphilipsmith commented 1 year ago

I've made a start on this, and here are some notes:

Geometries

There are three different versions of the Output Areas (OAs) available from the ONS. By ad-hoc visual inspection, the Generalised Clipped EW (BGC) is at least as good as the current data - but will review.

Each of these versions has 188880 records. This is consistent with the census data (see below). I presume from the filename that the OA are from 2011, and that's the reason for the discrepancy between the number of records in the census data and the geometry data.

There does not seem to be a straightforward way to programmatically download the data. The best (or least worst) way to use the WFS. However, there is a limit of 4000 records per request. It seems that OGR has a way of handling this: https://gis.stackexchange.com/questions/422609/downloading-lots-of-data-from-wfs

Other options are:

Census data from browser download

The WebUI to access the car ownership census data is here: https://www.ons.gov.uk/datasets/TS045/editions/2021/versions/3/filter-outputs/a20437fb-ae7f-439b-bc91-de261335038b#get-data This need to be downloaded manually.

The csv can be downloaded using wget from this URL (though it is unclear if this URL will remain valid)): CENSUS_URL = "https://static.ons.gov.uk/datasets/a20437fb-ae7f-439b-bc91-de261335038b/TS045-2021-3-filtered-2023-03-13T16:49:47Z.csv"

This has 944401 lines (including the header) 944400 / 188880 = 5 categories per OA, which is what we expect.

Census data using the NOMIS API and UKCensusAPI

* keep a permalink to the extract? (let's see if https://www.nomisweb.co.uk/Query/GetFile?filename=2422351643080675.csv lasts over time)

I tried this on 2023/04/21 and got this error message:

Session timed out or not logged in: This page is only available when you query data on Nomis, you cannot bookmark it or directly link to it.

So unfortunately, that's not an option...

I tried generating the API link through https://www.nomisweb.co.uk, but got an internal error of some sort.

I also tried this and got the same error message. However, in the help, it states that there is a 25000 cell limit per request, which might be the issue we're hitting. It is possible to obtain an API key which increases the limit to 1 million cells. See https://github.com/virgesmith/UKCensusAPI#api-key

1M limit would be enough for the whole of England, for five categories per OA, but would not be sufficient for more complex census queries. In these cases, we could loop through batches of OA codes and then merge the results. Might be useful if we want to be able to get lots of different data from NOMIS. It's not clear if the UKCensusAPI package can do this. If not, then it might be appropriate to make a PR to that package rather than adding it here.

andrewphilipsmith commented 1 year ago

A question: Where does this pipeline need to be run? Are there any restrictions on the environment? (I note there is a comment above about processing bulk downloads in the browser).

dabreegster commented 1 year ago

Thanks for the investigation!

Whilst it might be possible to hack this, it seems like a very fragile solution as no doubt the service has been designed to prevent this

Hmm, I can get to a download link with a few clicks manually. I wonder if we can write a simple spider to mimic what a human does; https://github.com/yt-dlp/yt-dlp is a modern example that works great for many sources. Or we probably have some ONS contacts and could ask them what they want the right approach to be for reproducible pipelines.

If not, then it might be appropriate to make a PR to that package rather than adding it here.

+1 to improving something that exists if other projects are already using it and it's well-established

Where does this pipeline need to be run? Are there any restrictions on the environment?

At the moment, manually in a Linux or Mac environment. Conceivably we might want to run it on a cloud VM someday. As long as the dependencies are easy to install (one or two commands) and things won't quietly break over time (so, something with lockfiles pinning to specific versions), I'd be happy. On that note, I'd love to ask what the current best practices for Python dependencies/environments are from REG's perspective -- I've had most luck with Poetry, but still not perfect. :\

andrewphilipsmith commented 1 year ago

Adding notes from @sgreenbury

Scotland + NI

Ways to get data:

Scotland

A dataframe of OA11 with people counts for age and gender, and ethnicity is obtainable with the UKCensusAPI. It may be better thought to simply extract the code used to get the required table and use in our own package(s).

from ukcensusapi.NRScotland import NRScotland
nrscotland = NRScotland("cache/")
# Get the data for a single LAD at OA level
df = nrscotland.get_data(
    coverage="S12000033",
    resolution="OA11",
    table="DC2101SC"
)

You can loop over all LADs and concat to get all LADs at OA-level.

Northern Ireland

Ready made

Small area vs. Data Zone

API

andrewphilipsmith commented 6 months ago

I’ve been looking again at the UK - but mostly England - census data. I've been trying to find a solution that doesn’t require manually finding the URL of every table we require. There seem to be three main (slightly overlapping options):

1. The UKcensusAPI package.

As far as I can tell:

How does this tally with the experience of using it within SPC?

2. The main Nomisweb REST API

3. Use the list of “Bulk Downloads”

See https://www.nomisweb.co.uk/sources/census_2021_bulk

It would be possible to use some web-scraping process (i.e. Beautiful soup) to get a list of zip files to download. Then download all of them and process the results. This feels like it would be quicker to implement but would be a lot more fragile.

My impression is that option 2 is the best option, though creating a PR for UKcensusAPI might be a good way to achieve the same thing. Any opinions? I would be particularly keen to hear from those who have used UKcensusAPI.