alan-turing-institute / uatk-spc

Synthetic Population Catalyst
https://alan-turing-institute.github.io/uatk-spc/
MIT License
20 stars 12 forks source link

Remove dependency on proj #47

Closed dabreegster closed 1 year ago

dabreegster commented 1 year ago

@mfbenitezp is hitting more proj issues. This is the only problematic external dependency we have, and it'd be so great to remove it. Where do we use it?

We use data/raw_data/nationaldata/MSOAS_shp/msoas.dbf just to get the polygon per MSOA. It comes somewhere from ONS, but I've definitely found this in nicer formats (GeoJSON or TopoJSON) and already in WGS84. We could just swap the inputs out. @HSalat, this shapefile has a population count unrelated to the rest of SPC, and we plumb it through in the output. Is it important at all?

The other use is converting the coordinates of venues from QUANT. We could just rewrite this file once to use WGS84. AFAICT, the QUANT data file we use is not completely based on anything open source (https://github.com/maptube/QUANT_RAMP is not enough to reproduce the tar.gz file we got), so making further modifications to this data is fair game. Do either of you know where the QUANT data file we use came from?

HSalat commented 1 year ago

I'm not against switching to another format. The point of having the population counts inside the file is that they are correct (the number of individuals predicted by SPC isn't), so it's good control. I also remember I used it somewhere but can't remember, probably some offline stuff. @mfbenitezp was going to do the new GIS stuff (at different scales).

The QUANT file is coming from before we started and was supposedly given directly by CASA. I see no issue with modifying it. Could you upload the final version to nationaldata-v2 on azure once finished?

dabreegster commented 1 year ago

Do you know where the population counts came from? We can preserve them regardless, just easier to know the origin

On Tue, Jan 31, 2023, 1:13 PM Hadrien Salat @.***> wrote:

I'm not against switching to another format. The point of having the population counts inside the file is that they are correct (the number of individuals predicted by SPC isn't), so it's good control. I also remember I used it somewhere but can't remember, probably some offline stuff. @mfbenitezp https://github.com/mfbenitezp was going to do the new GIS stuff (at different scales).

The QUANT file is coming from before we started and was supposedly given directly by CASA. I see no issue with modifying it. Could you upload the final version to nationaldata-v2 on azure once finished?

— Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/uatk-spc/issues/47#issuecomment-1410325417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMWLF767YTBNGM5R7KNTSDWVEFXTANCNFSM6AAAAAAUMD7KUI . You are receiving this because you authored the thread.Message ID: @.***>

HSalat commented 1 year ago

I added them myself when I was doing the thing that needed them. I'll update when I remember what it was!

mfbenitezp commented 1 year ago

I've created the merged geographies (2011 census-wise) initially generalized at 20m ( can be also super-generalized to 200m) , for convenience I've kept the CD attributes from the English geometries, but add the corresponding codes from Scottish boundaries/census geographies. So far I have LAD, MSOA, LSOA, and OA for Scotland, England and Wales, and ready to get exported in the required format.

image

dabreegster commented 1 year ago

Thanks Fernando!

If we're preserving what SPC outputs today, we just need MSOA11CD, the polygon in WGS84, and this mystery population count: https://github.com/alan-turing-institute/uatk-spc/blob/13ce829a0d449c3761670fdfd5576f5fa55f8923/synthpop.proto#L33 The detailed 20m resolution at MSOA level is a negligible cost. If we want to include OAs, then it'd be worth comparing file size / load time / web rendering impact at both.

My preferred format is GeoJSON or TopoJSON, but even shapefile is OK if needed. The goal is to just do the reprojection somewhere once, so the Rust code doesn't need a dependency on system Proj

HSalat commented 1 year ago

I think it'd be good to have all three resolutions readily available: OA bc it's the household res (to draw flows e.g.), LSOA bc it's the workplace res (flows again, or workplace breakdown), and MSOA bc it's a better res for aggregate info (pop) + might be required for performance when OA (for bigger areas e.g.)

dabreegster commented 1 year ago

Then Fernando, how about 3 files (OA, LSOA, MSOA) in TopoJSON (1st option) or GeoJSON (2nd option), with at least the canonical ID as a property, and ideally also this population count?

mfbenitezp commented 1 year ago

I'll work on the Pop attributes and the conversion.

mfbenitezp commented 1 year ago

MSOA, and LSOA were shared in GeoJSON, I will work on the OA with Pop now.

mfbenitezp commented 1 year ago

Now, LAD, MSOA, LSOA, and OA with Pop from 2020 are included in Azure in GeoJSON and in WGS84. Is there anything else I should provide?

dabreegster commented 1 year ago

Thank you! I think we're good for now, I'll cut over the Rust code to use the new files, and remove the proj dependency totally.

dabreegster commented 1 year ago

Sorry, going to keep this open until the Rust stuff is done, if that's OK

dabreegster commented 1 year ago

This is done now; we don't need proj in the new_schema branch anymore. Thanks for the help Fernando!