NCAR / wrfcloud

WRF Cloud Framework
Apache License 2.0
15 stars 6 forks source link

Add logic for parallel job submission through slurm #75

Closed mkavulich closed 2 years ago

mkavulich commented 2 years ago

This PR adds the "wrfcores" config option. If set to 1 (default), the previous serial-run behavior is maintained. If >1, the real.exe and WRF tasks will be submitted as a parallel job to the slurm queue with wrfcores number of cores. If >96, will fail with an error message, as the current default instance only has 96 cores available.

Resolves #74

Expected Differences

Pull Request Testing

Ran tests with 1, 36, and 96 cores. 1 core defaulted to original (serial) behavior as expected (fails at real.exe due to lack of memory). 36 and 96 core tests worked as anticipated, with 96 cores running at ~20x realtime (0.9 wallclock seconds per 20s simulated time step). Attempted to run with 100 cores, and received the appropriate error message.

Test procedure

wrfcloud-cluster create
wrfcloud-cluster connect
git clone https://github.com/NCAR/wrfcloud -b feature-74/add_slurm_submission
cd wrfcloud/python/src/
pip3 install --user .
mkdir -p /data/input_data/gfs/2022060100/
cd /data/input_data/gfs/2022060100/
aws s3 sync s3://wrfcloud-xfer-tmp/ .
mkdir -p configurations/test; mv geo_em.d01.nc configurations/test
cp ~/wrfcloud/python/src/wrfcloud/runtime/test.yml .

Edit test.yml to the desired settings, especially testing the wrfcores value mentioned above.

vi test.yml
cp ~/wrfcloud/python/src/wrfcloud/runtime/configurations/test/namelist.* configurations/test/
unset I_MPI_OFI_PROVIDER
wrfcloud-run

See above instructions for running your own tests. As I mentioned, the existing commands result in an out-of-memory condition.

Pull Request Checklist

michelleharrold commented 2 years ago

I just finished testing. I ran 5 tests:

96 cores: Ran successfully in 18 mins 26 s (end-to-end) 64 cores: Ran successfully in 19 mins 55 s (end-to-end) 16 cores: Ran successfully in 51 mins 37 s (end-to-end) 1 core: Real failed (side note: did not error out at real; moved on to wrf and failed there when it didn't have the real outputs) 0 cores: Failed with appropriate message that invalid number of cores was provided.

mkavulich commented 2 years ago

One thing I did notice is that the wrfcloud-run command will only work when executed in the directory where the test.yml is found. Not a huge deal for R&D testing but wanted to mention it in case that wasn't expected behavior.

@fossell That is expected behavior, and should be a prerequisite; otherwise there's no way to indicate where to look for the file! I suppose we could put this in the environment variables yaml, but I think it's besy to leave that as a prerequisite for now.