NOAA-EMC / EMC_verif-global

Global Forecast System (GFS) verification package using MET and METplus
8 stars 14 forks source link

Running as part of global-workflow fails in exgrid2grid_step2.sh with srun: no record for task id 1 #133

Closed DWesl closed 2 weeks ago

DWesl commented 2 weeks ago

Running for a C768 run as part of global-workflow produces a specification with nodes=1, ppn=4, and tpp=1. Running with ush/run_verif_global_in_global_workflow.sh produces a job with nproc=${npe_node_metp_gfs}=1. When run on HERA, scripts/exgrid2grid_step1.sh launches the METplus job with srun --multi-prog /path/to/task-file, where task-file has nproc lines detailing commands to execute. srun then fails because it can't find as many tasks as it wants; I think it is defaulting to four tasks.

Changing scripts/exgrid2grid_step1.sh to specify --ntasks ${nproc} as part of the srun command allows the process to finish. A better solution probably involves changing how ush/run_verif_global_in_global_workflow.sh determines nproc: man sbatch suggests SLURM_NTASKS, but global-workflow probably has a variable to specify the number of threads that would be less closely tied to the job manager.

DavidHuber-NOAA commented 2 weeks ago

@DWesl This was recently fixed in the global-workflow as part of an overhaul of the resource configuration system. The job now runs with a single task by default. See https://github.com/NOAA-EMC/global-workflow/pull/2804 and let me know if updating your global-workflow resolves the issue.

DWesl commented 2 weeks ago

The new setting in verif-global (should have checked this earlier) references nproc and defaults to one: https://github.com/NOAA-EMC/EMC_verif-global/blob/92904d2c431969345968f74e676717057ec0042a/ush/run_verif_global_in_global_workflow.sh#L277-L279 and global-workflow sets nproc in config.metp, which should solve the problem more generally.