Unidata / netcdf-fortran

Official GitHub repository for netCDF-Fortran libraries, which depend on the netCDF C library. Install the netCDF C library first.
Other
236 stars 95 forks source link

add slurm files to run parallel I/O tests (and other batch files?) #258

Open edwardhartnett opened 4 years ago

edwardhartnett commented 4 years ago

As we've just encountered on some NOAA HPC systems, users are not allowed to run mpi tasks on the HPC system login nodes. That means parallel I/O jobs cannot be launched with mpiexec, they must be scheduled with some scheduling system.

Slurm is such a system and is becoming very popular. By providing a slurm file, users like NOAA can run parallel I/O tests to confirm that netcdf-fortran is functioning properly.

This will require one additional file for each batch of tests. For example we have nf03_test4/run_f90_par_tests.sh, we will add nf03_test4/slurm_run_f90_par_tests.sh. The user will have to run the test from their command line, because we don't know what user account the time should be charged to.

One fly in the ointment, the build include --enable-parallel-tests which builds and runs the parallel tests with mpiexec. Now I need to add an --enable-slurm option, which will cause the tests to be built, but not run by make check. They must be run by the user executing the slurm file.

So this will add two files to the build, one in nf03_test4 and one in nf_test4.

edwardhartnett commented 4 years ago

Message from George:

When testing the parallel netcdf API, we need a test that runs in a batch environment using the MPI launcher in that environment which can be mpirun, srun, mpiexec.lsf, or something else. Can this be done with make test or make check?

Interactive MPI is not available on the NOAA WCOSS platforms and if it were, it might behave very differently (mpirun working for example through a node memory API whereas aprun or mpirun on multiple nodes doesn't). So we need a batch make check as a test

--

edwardhartnett commented 4 years ago

See also #262.

edwardhartnett commented 2 years ago

@GeorgeVandenberghe-NOAA do you have slurm files that would run netcdf-fortran tests?

GeorgeVandenberghe-NOAA commented 2 years ago

No but it's trivial to set up.

On Tue, Jun 21, 2022 at 6:17 AM Edward Hartnett @.***> wrote:

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA do you have slurm files that would run netcdf-fortran tests?

— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-fortran/issues/258#issuecomment-1161550162, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FSV6Z5XKY5ENDXVCZTVQGJE5ANCNFSM4NBWNTSQ . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

edwardhartnett commented 2 years ago

If you can do that I can integrate it with netcdf... I need the slurm files which run the tests of nf_test and nf_test4.

edwardhartnett commented 2 years ago

If you can do that I can integrate it with netcdf... I need the slurm files which run the tests of nf_test and nf_test4.

GeorgeVandenberghe-NOAA commented 2 years ago

Which release should I start with. Also there are other schedulers besides slurm which will make this a little more complicated since we either have to autodetect schedulers or (the more plausible approach) run the whole test inside a job externally submitted and autodetect the mpi launcher.

On WCOSS2 we use PBS Pro as the scheduler. The older RDHPCS systems used MOAB/Torque.

On Tue, Jun 21, 2022 at 10:14 AM Edward Hartnett @.***> wrote:

If you can do that I can integrate it with netcdf... I need the slurm files which run the tests of nf_test and nf_test4.

— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-fortran/issues/258#issuecomment-1161804475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FSLJ7PVHEZO5CTIOGDVQHE2BANCNFSM4NBWNTSQ . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA commented 2 years ago

I thought I already sent this but it got lost. Which release should I start with?

On Tue, Jun 21, 2022 at 10:14 AM Edward Hartnett @.***> wrote:

If you can do that I can integrate it with netcdf... I need the slurm files which run the tests of nf_test and nf_test4.

— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-fortran/issues/258#issuecomment-1161804475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FSLJ7PVHEZO5CTIOGDVQHE2BANCNFSM4NBWNTSQ . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

edwardhartnett commented 2 years ago

The most recent.

GeorgeVandenberghe-NOAA commented 2 years ago

The form of an interactive parallel command in a serial script is platform and administrator preference dependent. It's not possible at all on some systems. I've asked the RDHPCS support people for the srun options inside a serial script to start a single parallel command. This is not available at all on WCOSS.

We also need to know the MPI launcher whether it's aprun, mpiexec, mpirun or something else. It looks like netcdf defaults to mpiexec and there is a configure option to change it. It's possible to get autoconf or even cmake to detect this and make the change automatically , otherwise we will need autotect capability so we can specify it in the configure option you have already supplied in the netcdf-c and netcdf-fortran distributions.

An alternative which is tractable on all systems I have worked on so far is to do the entire ten minute build in a batch job. Make and make check (with mpi launcher suppled) then works out of the box with the scripts you have. It is the user's problem to set up the batch job and that's dependent on what the admins have supplied. Systems I've worked on in the past ten years have used IBM's LSF and LoadLeveler, Torque/MOAB, Cray/ALPS, PBS/Pro and Slurm. The latter seems to be spreading and becoming dominant.. a good standardization.

At worst I can supply a template of batch jobs for the various systems. I'm still hoping to improve on that.

On Tue, Jun 21, 2022 at 3:11 PM George Vandenberghe - NOAA Affiliate < @.***> wrote:

I thought I already sent this but it got lost. Which release should I start with?

On Tue, Jun 21, 2022 at 10:14 AM Edward Hartnett @.***> wrote:

If you can do that I can integrate it with netcdf... I need the slurm files which run the tests of nf_test and nf_test4.

— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-fortran/issues/258#issuecomment-1161804475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FSLJ7PVHEZO5CTIOGDVQHE2BANCNFSM4NBWNTSQ . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)