LLNL / merlin

Machine Learning for HPC Workflows
MIT License
119 stars 26 forks source link

[BUG] slurm_par example only spins up 1 node instead of 2 #429

Open bgunnar5 opened 1 year ago

bgunnar5 commented 1 year ago

Bug Report

Description When running the slurm_par example, the runs step fails for each sample ran due to a slurm allocation issue. The following error is placed inside each runs.slurm.err file that's generated: srun: error: Only allocated 1 nodes asked for 2.

To Reproduce Steps to reproduce the behavior:

  1. Pull the slurm_par example with merlin example slurm_par
  2. Cd into the slurm/ directory
  3. Queue the tasks with merlin run slurm_par.yaml
  4. Run the workers with merlin run-workers slurm_par.yaml
  5. When it's done running look in the output directory at runs/00/runs.slurm.err to see the error

Expected behavior We want two nodes allocated with slurm for this step.

Please answer these questions to help us pinpoint the problem

Additional context Bug found by Casey Lamarche