DOI-USGS / lake-temperature-process-models

Creative Commons Zero v1.0 Universal
1 stars 4 forks source link

Bundle simulations when scale up to midwest #41

Closed hcorson-dosch-usgs closed 2 years ago

hcorson-dosch-usgs commented 2 years ago

When we scale up to the whole footprint, we should consider moving away from our current targets workflow where we have unique targets branch for each model simulation (18 branches per lake) toward an approach where we are bundling multiple simulations into a single branch.

Jordan notes:

might want to think about bundling in the future, as the whole footprint (~15K lakes) may create so many branches that targets could stuggle. ~270K of them. I'm not sure if that is a limitation or not, but remake's backend really started to slow when we got into really high target counts.

In the past runs at scale, we bundled 40 sims together per job/task. But the jobs and tasks now are handled very differently vs simple slurm arrays we were using.

I asked Jake "Zwart, Jacob A didn't you run into some limits with # of branches?" and he weighed in:

Yes. We originally tried running 584,000 branches, but that was too much for Targets to handle - 11,000 branches has worked and currently working on 7,305 branches for multiple targets and it seems to run OK. We haven't tried to find the limit for number of branches, and generally they encourage "batching" , also see reply here . If running a bunch of branches per target, we also found it helpful to request fewer workers than the cores you requested so that multiple cores can help with the targets overhead (see here) - we haven't optimized how many cores are needed to help with overhead but we've assigned 60-70% of cores to workers and the rest to deal with targets overhead when running 7,000+ branches

hcorson-dosch-usgs commented 2 years ago

I'm currently using Jake's grouping approach for the GCM runs -- grouping the 18 runs for each lake into a single branch -- unmerged fork here.

For reference -- Jake's setup of group length, creating another column that is the target group based on the group length and using the grouped table for running another target.

See also how Jake set OpenMP environment variables in the pipeline here. With this approach, you still need to specify the # of workers in your tar_make_clustermq() call. Jake said he usually sets that roughly equal to the p3_openmp_threads target (~40-50). If submitting the targets command in a slurm script, he noted you could do the calculation in the slurm script instead.