DyME/SPC + REG chat about HPC access

dingaaling commented 1 year ago

DyME-CHH will need future projected populations data from SPC.

SPC is currently reproducing the newest version of SPENSER which will allow expanding geographically (England > GB) and temporally (beyond 2020). This will be used to create future projected populations data.

The SPC team is currently doing this work on their personal laptops which is very slow.

The DyME-CHH/SPC team would like to chat about:

Availability of HPC resources before EOY 2022
Setting up SPC to run on a HPC for projected datasets

crangelsmith commented 1 year ago

Hi @HSalat @RuthBowyer! In order to start this request it would be helpful to know more details about the requirements.

I assume these are Azure resources you are looking for?
What is the budget available, is this E0Y 2022 (December) or the end of the financial year?
Type of resources needed, GPUS, VMs with many cores? Storage space?
Is there any sensitivity in the code/data?

HSalat commented 1 year ago

Hi @crangelsmith!

The SPC side can contribute an Azure subscription (REMUS) with £427.71 remaining (until 31 March 2023).

The main task for the moment is to run a model that was coded by the University of Leeds, so I do not know how it was optimised unfortunately. I ran one LAD for one projected year as a test and it took about 3 hours with my laptop. This process would have to be repeated 324 x 5 times as a minimum. I was wondering, as a matter of fact, if someone would be able to look into "obvious" optimisations of the codebase before we start using HPC resources. For reference, the process comprises 3 steps involving 5 separate repositories:

git clone -b master --single-branch https://github.com/ld-archer/UKCensusAPI.git
git clone -b master --single-branch https://github.com/ld-archer/ukpopulation.git
git clone https://github.com/virgesmith/humanleague.git
git clone -b arc --single-branch https://github.com/nismod/household_microsynth.git
git clone -b arc --single-branch https://github.com/nismod/microsimulation.git

The code is under MIT licence and the data is based on the census 2011, but in an aggregated form, so it shouldn't be an issue.

crangelsmith commented 1 year ago

Hi @HSalat

@AoifeHughes and I can try to take a look for optimisation of the codebase.

Could you give us some instructions on how exactly to run this test that took 3 hours and which of these repos is the main code base to optimise?

HSalat commented 1 year ago

Hi,

I've attached the full script to install and test the different packages. Please be aware that only the branches indicated will work. Some tests must be run twice before they work but I've been advised not to worry about it. @ld-archer is a much better point of contact than me for SPENSER as I wasn't directly involved in that project.

Then, this needs to be done first from inside the household_microsynth directory: scripts/run_microsynth.py E09000002 OA11

Followed by three stages from inside microsimulation using the config files attached placed in the config folder:

scripts/run_ssm.py -c config/ssm_current.json E09000002
scripts/run_ssm_h.py -c config/ssm_h_current.json E09000002
scripts/run_assignment.py -c config/ass_current.json E09000002

The last step is the one that is really slow.

SPENSER_HPC_setup5.sh.zip

config_files.zip

crangelsmith commented 1 year ago

Hi @HSalat,

I've tested running the steps above in an Azure VM ( 16 Gb RAM, 4 cores) and these were the running times:

household_microsynth/scripts/run_microsynth.py E09000002 OA11 -> 50 minutes
microsimulation/scripts/run_ssm.py -c config/ssm_current.json E09000002 -> 53 seconds
microsimulation/scripts/run_ssm_h.py -c config/ssm_h_current.json E09000002 -> seconds
microsimulation/scripts/run_assignment.py -c config/ass_current.json E09000002 -> 43 minutes

So looks like steps 1 and 4 are the slow ones, does this look right to you?

As we don't really own the code, I would restrain from trying to optimise it as a first step, but I think we can work in using parallelization by running in parallel several LAD codes and using a VM with many cores.

We did something similar a couple of years ago with another microsimulation pipeline also linked to SPENSER. See here.

When you say you have to run this 324 x 5 times, is it 324 LADs? and 5 time periods (defined by the config files?)? I can start testing parallelizing on LADs, do you have a list that we should use?

HSalat commented 1 year ago

So looks like steps 1 and 4 are the slow ones, does this look right to you?

Yes, if I remember correctly, I had to do 1 again just before re-running 4, so the long bit must have been 1 + 4 together.

As we don't really own the code, I would restrain from trying to optimise it as a first step, but I think we can work in using parallelization by running in parallel several LAD codes and using a VM with many cores.

We've been in touch with them. Their funding has ended and we have their blessing to do what we need with it (the code is under MIT licence). That said, we only need to do 1 once for each lad, then 4 about 1600 times, so about one week if 10 LADs are running in parallel, which would be ok. @RuthBowyer can you confirm which years would work for you? I was going to suggest: 2011 (min) 2020 (we need this one) 2030 2040 and 2050?

When you say you have to run this 324 x 5 times, is it 324 LADs? and 5 time periods (defined by the config files?)? I can start testing parallelizing on LADs, do you have a list that we should use?

That's correct, although I am now realising that I forgot to count the area codes for Wales and Scotland. It is not clear to me how SPENSER's projections work with LADs that have changed after 2011 and I couldn't find the codes for Scotland in 2011. I've attached a list of codes and names for 2011 in England and Wales and will ask for confirmation about what codes exactly are supported.

lad_list.csv

crangelsmith commented 1 year ago

We've been in touch with them. Their funding has ended and we have their blessing to do what we need with it (the code is under MIT licence). That said, we only need to do 1 once for each lad, then 4 about 1600 times, so about one week if 10 LADs are running in parallel, which would be ok. @RuthBowyer can you confirm which years would work for you? I was going to suggest: 2011 (min) 2020 (we need this one) 2030 2040 and 2050?

I think it might take longer for us to understand the details of each one of these libraries in order to do a refactoring that might not guarantee a significant speed up, also I'm worried of breaking a codebase i'm not sure how to test in a short time scale. For something we only have to run once, it might not be worth it I think?

We can try the parallelisation route with a few VM with several cores, or use Azure batch. I understand that Dyme has some budget that we can use for this, and getting a VM for a couple of months would cost a few hundred pounds, not sure yet how much would the Azure batch costs.

A couple more questions just to make sure I understand:

We only have to run step 1 once, for all the LADs. There is no time period dependence in this part of the job, right?
Step 1 has to be run before steps 2, 3, 4?
Step 4 uses outputs of steps 1, 2, and 3?

If the answers is yes, we can divide it to two tasks.

Task 1: Step 1 ran over N LADs, either parallelised in a VM or sent in batch jobs using Azure.
Task 2: Step 2, 3 and 4 in that order, for N LADs, again parallelised or batch jobs.

with

N: Number of LADs

Then,

Task 1 is ran only once, and before task 2.
Task 2 is ran M times.

M: Number of time periods (a hand-full of times), each time has its own triad of config files (ssm_current.json, ssm_h_current.json, ass_current.json).

Does all this make sense?

HSalat commented 1 year ago

All correct, except 2 and 3 produce and store all "intermediary" years, so they only need to be run once for the latest date.

Run 1 for N LADs
Run 2 and 3 for final year (tbc) and N LADs
Run 4 for N LADs and M selected years

RuthBowyer commented 1 year ago

We've been in touch with them. Their funding has ended and we have their blessing to do what we need with it (the code is under MIT licence). That said, we only need to do 1 once for each lad, then 4 about 1600 times, so about one week if 10 LADs are running in parallel, which would be ok. @RuthBowyer can you confirm which years would work for you? I was going to suggest: 2011 (min) 2020 (we need this one) 2030 2040 and 2050?

This sounds great, thanks Hadrien! Any chance it could go to 2060 or even 2080, or is this too tricky?

crangelsmith commented 1 year ago

Thinking more about it, looks like using the Batch Azure service is probably the best strategy.

If we create a job that for an input LAD runs:

Run 1
Run 2 and 3 for final year (tbc)
Run 4 M times for selected years consecutively

then we just need to submit ~350 jobs (for the N LADs).

If each job takes ~10 hours (probably less but lets be conservative), then we would have to pay for ~3500 hours of batch computing which is around £0.3040/hour on a decent VM (take a look at the pricing here). But as is in parallel we'll be done on a day.

How big is the output for given LADs? Based on this we can optimise the kind of VM we use and its price.

RuthBowyer commented 1 year ago

Just to highlight here something we spoke about in our Dyme Catch up so Hadrian is in the loop - potentially I would be interested in this for SPC if its possible - 5 socioeconomic pathways at decade intervals https://www.ukclimateresilience.org/products-of-the-uk-ssps-project/ Also tagging @mfbenitezp

HSalat commented 1 year ago

I've been advised to use 2020 codes, attached. That's a total of 368 including Scotland. new_lad_list.csv

The issue with the projections is that we don't have any actual validation of the process, so the further away it goes from 2020 the less reliable they get. The pricing is very cryptic to me, as I don't know what it means in terms of actual performance, but it looks expensive. So, it's best to keep to a few dates I think. It's difficult to estimate the size of the output since LADs are very irregular, but it looks like it would be around 0.5 Gb per LAD, although we don't need to keep everything and the files shrink massively when compressed due to many repeated rows.

RuthBowyer commented 1 year ago

I've been advised to use 2020 codes, attached. That's a total of 368 including Scotland. [new_lad_list.csv] (https://github.com/alan-turing-institute/dymechh/files/9931061/new_lad_list.csv)

Thanks for sharing this Hadrien! Just to check, does this mean Northern Ireland won't be in there (could not see any NI domains but I know LAD level is a bit different there) ?

HSalat commented 1 year ago

We don't have any plans for NI because all the data is different unfortunately (SPENSER and QUANT are GB only as a matter of fact).

crangelsmith commented 1 year ago

The azure batch pipeline is being developed here: https://github.com/alan-turing-institute/spc-hpc-pipeline

For now is still very much WIP, I'll update you when we have a stable version.

alan-turing-institute / dymechh

DyME/SPC + REG chat about HPC access #26