ecmwf / ecbuild

A CMake-based build system, consisting of a collection of CMake macros and functions that ease the managing of software build systems
https://ecbuild.readthedocs.io
Apache License 2.0
26 stars 25 forks source link

how to add more srun command parameters in ecbuild_add_test #61

Closed TingLei-NOAA closed 6 months ago

TingLei-NOAA commented 6 months ago

Is your feature request related to a problem? Please describe.

A user ctest is to be run with a fixed mpi task numbers (say, -n 36). however in another machine, this job has to be run on two nodes to avoid memory related failure.

Describe the solution you'd like

The easiest way for me is to add srun comand parameters as " -N 2 --ntasks-per-node=18" in ecbuild_add_test. But digging in ecbuild_add_test.cmake and doing "try and error" , I failed to find a way to do so with ecbuild_add_test

Describe alternatives you've considered

So, I am wondering if ecbuild experts could clarify on this. Did I miss any existing ecbuild functions to do this? Thanks.

Additional context

No response

Organisation

LYNKER--EMC/NCEP/NOAA

wdeconinck commented 6 months ago

Probably the better approach is to run all the tests on a single allocation rather than have each test go on the queue. With salloc (or sbatch) you could then make sure you have enough nodes and limit the number of tasks per node.

salloc -q <queue> -N 2 --ntasks-per-node=18 ctest

In case ctest now calls e.g. srun -n 36 hostname I can see that the output is 18 tasks each on their own node.

TingLei-NOAA commented 6 months ago

@wdeconinck Yes. Your suggested method does work! That could be used as a work-around, at least, for being now. The problem is that this ctest could be run among other ctests which don't need those extra nodes. So, a test-specific control is desired. If to add processing for such extra options in the ecbuild is not feasible or not regarded as necessary, I can replace the current executable with a wrapping script in which various manipulation could be done.

wdeconinck commented 6 months ago

ecbuild_add_test goes through this logic: https://github.com/ecmwf/ecbuild/blob/develop/cmake/ecbuild_add_test.cmake#L419C1-L432C85 MPI_ARGS could be user-defined, and was intended for e.g. openmpi's "--oversubscribe" flag.

It would be up to you to set the to -N 2 --ntasks-per-node just for that test, and only in the case of slurm and make sure it is set to original value after ecbuild_add_test.

Note I did not test following, but conceptually:


if (MPI_ARGS)
  set(RESTORE_MPI_ARGS ${MPI_ARGS})
endif()
if ( <slurm detected> )
    set( MPI_ARGS "${MPI_ARGS} -N 2 --ntasks-per-node=18")
endif()
ecbuild_add_test( ... MPI 36 ... )
if( RESTORE_MPI_ARGS )
  set(MPI_ARGS ${RESTORE_MPI_ARGS})
endif()
TingLei-NOAA commented 6 months ago

@wdeconinck , Did you mean I need to do changes (you suggested) to ecbuild_add_test to use MPI_ARGS ( and, yes, just to use MPA_ARGS as one argument didn't work in my tries )?

wdeconinck commented 6 months ago

@TingLei-NOAA no I meant it like in the code snippet above, to set MPI_ARGS as a normal CMake variable. It gets picked up by ecbuild_add_test. Its original intention is as a global variable via cmake configuration "-DMPI_ARGS=..." but then you're in the same undesired situation where it gets applied to each MPI test. That's why I suggested to do the "RESTORE_MPI_ARGS ... " dance.

TingLei-NOAA commented 6 months ago

@wdeconinck Sure. Changing of MPI_ARG before and after the ecbuild_add_test seems a solution. I will try it and come back with an update here.
So many thanks!

TingLei-NOAA commented 6 months ago

The changes suggested by @wdeconinck does work. Thank you @wdeconinck for your help! I would close this issue!