FriesischScott / UncertaintyQuantification.jl

Uncertainty Quantification in Julia
MIT License
31 stars 11 forks source link

Better HPC interfacing #153

Closed AnderGray closed 7 months ago

AnderGray commented 8 months ago

Julia's inbuilt parallelism isn't really good for launching complicated workflows / models on clusters. Would be useful to have something that can interface better to the scheduler.

Example:

@FriesischScott has suggested a nice trick, where the solver calls sbatch and waits for the job to complete:

OpenSees = Solver(
    "sbatch",
    "launch_sim.sh";
    args="--wait",
)

with launch_sim.sh containing everything need to run my workflow.

Nice, but, has some problems:

Slurm job arrays

Slurm's job arrays is a nice way to manage many similar jobs, like jobs which differ by just an index (e.g. sample-N). Also allows you to preallocate total amount of resources needed, and how many are concurrently run etc.

It would require a bit of engineering, but perhaps we could an interface of some sort could be written ... i.e. when pmap is called in External model, the input files are created (directories created, files copied, values interpolated), and a slurm array submitted which will loop through the individual jobs.

Maestro workflow manager

Could also be something to look into: https://github.com/LLNL/maestrowf?tab=readme-ov-file https://maestrowf.readthedocs.io/en/latest/index.html

Provides a yaml and command line tool for performing parameter studies, and can work with slurm.

Scheduling studies: https://maestrowf.readthedocs.io/en/latest/Maestro/scheduling.html#flux

FriesischScott commented 8 months ago

If I understand corrently the whole point is to instead of having one job that runs all samples in parallel instead submit one job for each sample that is then parallelized, correct?