Some targets need more resources

PhDyellow commented 3 years ago

The amount of resources needed by each target varies.

The minimum memory I can request on Getafix is 16GB, any less and the scheduler puts the job onto a node that does not have singularity. I have been rounding up to 20GB just to be sure.

Time

All targets share the same time pool, including targets that can be skipped because they have already finished. No target takes more than 24-48 hours, even on my slow laptop, so setting the walltime to something around 3 days or greater is fine. If I know how many targets should run, I can estimate this better to get through the scheduler faster, but asking for too much time is generally not a big issue. I have this set to 7 days, and don't see a need to change it until I know exactly how long things should take.

CPUs

Some targets contain code that can make use of multiple CPUs. For now, debugging parallel processing within parallel processing is not worth the speedups I could gain in a few places, especially because the top level parallelisation already provides a lot of speed up potential, 20-30 times faster or more depending on how many surveys are being included.

A list of places that can use parallel processing, if I come back to it later (EDIT to keep up to date):

Memory

20GB is the minimum I can request per target, but some targets have crashed with 20GB.

A list of targets that need more than 20GB, and roughly how much they need:

gfbootstrap_combined_tmp (Ran with 200GB, peak usage was ~85GB)
- I suspect that gfbootstrap_combined_tmp uses so much memory because it pulls in all the GF models at once. Most other steps map over each model.
gfbootstrap_combined (crashed with 20GB, some crashed with 60GB, but not all.)

PhDyellow commented 3 years ago

Assigning extra resources for one target is not trivial in clustermq.

I can either run the entire pipeline with 100GB per node (wasteful) or I can run the pipeline in stages, with some stages pulling in more memory.

See https://github.com/ropensci/targets/issues/198#issuecomment-712333764 for an example of re-running the pipeline with job specific resouces.

I can use tidy select statements, so perhaps I can do something like this:

Every target has a suffix specifying a resource class and a stage
- not keen on this idea, the suffix goes all through the code

A different approach is to just list the names of targets in the submission script. It needs to be maintained in the submission script, but that is a good place to do it, given that worker resource requirements are a function of the machine we run on, not the pipeline itself.

Be aware of all the pipeline targets, and the connections between them.
Figure out the stages the targets can be run in. A stage ends when all the targets that need a specific resource set are done, or can't be done because an upstream target needing more resources hasn't been done.

PhDyellow commented 3 years ago

xx <-tar_meta(targets_only = T)
xx[xx$type %in% c("stem", "pattern"),"name"]$name

Useful for getting all the names in the pipeline, assuming the _targets folder is up to date.

PhDyellow commented 3 years ago

Addressed by f6deec2, but not tested yet.

PhDyellow commented 1 year ago

Closing

The issue of some targets needing different resources is handled in a few ways:

aus_bio_content.sh now splits pipeline up according to needs, so a single computer can handle varied target needs efficiently.
targets now allows the crew backend. Probably still needs 1. locally.

Batching is now implemented in key targets that use very large amounts of memory.

MathMarEcol / pdyer_aus_bio