COSIMA / cice5

Clone of The Los Alamos sea ice model (CICE) with ACCESS drivers. See https://github.com/CICE-Consortium/CICE-svn-trunk/tree/cice-5.1.2
4 stars 13 forks source link

Investigate using parallel IO #34

Closed aekiss closed 3 years ago

aekiss commented 5 years ago

It may be worth trying to compile with parallel IO using PIO (setenv IO_TYPE pio).

We currently compile CICE with serial IO (setenv IO_TYPE netcdf in bld/build.sh), so one CPU does all the IO and we end up with an Amdahl's law situation that limits the scalability with large core counts.

At 0.1 deg CICE is IO-bound when doing daily outputs (see Timer 12 in ice_diag.d), and the time spent in CICE IO accounts for almost all the time MOM waits for CICE (oasis_recv in access-om2.out) so the whole coupled model is waiting on one cpu. With daily CICE output at 0.1deg this is ~19% of the model runtime (it's only ~2% without daily CICE output). Lowering the compression level to 1 (https://github.com/COSIMA/cice5/issues/33) has helped (MOM wait was 23% with level 5), and omitting static field output (https://github.com/COSIMA/cice5/issues/32) would also help.

Also I understand that PIO doesn't support compression - is that correct?

@russfiedler had these comments on Slack:

I have a feeling that the CICE parallel IO hadn't really been tested or there was some problem with it. We would have to update the netcdf versions being used in CICE for a start. the distributors of PIO note that they need to use netCDF 4.6.1 and HDF5 1.10.4 or later for their latest version. There's a bug in parallel collective IO in earlier hdf5 versions. The NCI version of netCDF 4.6.1 is built with hdf5 1.10.2! marshall noted above that Rui found a performance drop off when moving from 1.10.2 to 1.10.4. the gather is done on all the small tiles. So you have each PE sending a single horizontal slab several times to the root PE for each level. the number of MPI calls is probably the main issue. It looks like there's an individual send/recv for each tile rather than either a bulk send of the tiles or something more funky using MPI_Gather(v) and MPI_Type_create_subarray.

Slack discussion: https://arccss.slack.com/archives/C9Q7Y1400/p1557272377089800

nichannah commented 4 years ago

I've been looking at the CICE PIO code. It is not as complete as the serial netcdf code, for example it doesn't do error proper checking. The PIO code still exists and is documented in CICE6.

My next step is to see whether it can be built on raijin.

Another option, which may be better even if PIO works is to take the MOM5 approach and have each PE output to it's own file followed by an offline collate. The advantage of this would be that we can continue to use the existing netcdf code (with slight modifications). The down-side would be that we need to write a collate program.

russfiedler commented 4 years ago

@nichannah I think with the moves by Ed Hartnett wrt implementing PIO in FMS I think it would be best to go the PIO route to stay reasonably compatible with future FMS and CICEn.

nichannah commented 4 years ago

Steps to build PIO cice.

  1. download and extract pio:
wget https://github.com/NCAR/ParallelIO/releases/download/pio2_4_4/pio-2.4.4.tar.gz
tar zxvf pio-2.4.4.tar.gz
  1. load necessary modules:
module load intel-cc/2018.3.222
module load intel-fc/2018.3.222
module load netcdf/4.6.1p
module load openmpi/4.0.1

I also tried openmpi/1.10.2 but the build failed with link errors.

  1. set environment variables
export CPPFLAGS='-std=c99 -I${NETCDF}/include/ -L${PARALLEL_NETCDF_BASE}/include/'
export LDFLAGS='-L${NETCDF}/lib/ -L${PARALLEL_NETCDF_BASE}/lib/'
  1. configure and make
./configure --enable-fortran --prefix=/short/x77/nah599/access-om2/src/cice5/pio-2.4.4/usr
make
make install
nichannah commented 4 years ago

I looks like the CICE PIO code makes use of something called shr_pio_mod. Getting compile error like:

ice_pio.f90(9): error #7002: Error in opening the compiled module file.  Check INCLUDE paths.   [SHR_SYS_MOD]
  use shr_sys_mod , only: shr_sys_flush
------^
ice_pio.f90(7): error #7002: Error in opening the compiled module file.  Check INCLUDE paths.   [SHR_KIND_MOD]
  use shr_kind_mod, only: r8 => shr_kind_r8, in=>shr_kind_in
------^
ice_pio.f90(47): error #7002: Error in opening the compiled module file.  Check INCLUDE paths.   [SHR_PIO_MOD]
   use shr_pio_mod, only: shr_pio_getiosys, shr_pio_getiotype
-------^

The code can be found here:

https://github.com/CESM-Development/cesm-git-experimental/tree/master/cesm/models/csm_share

aekiss commented 4 years ago

Netcdf 4.7.1 is now installed on raijin on top of hdf5/1.10.5. The parallel version, 4.7.1p (and hdf5/1.10.5p), is built with openmpi/4.0.1.

nichannah commented 4 years ago

Followed my instructions as above with new versions and the configure step hangs. This seems to be caused by:

[nah599@raijin5 pio-2.4.4]$ module load intel-cc/17.0.1.132
[nah599@raijin5 pio-2.4.4]$ /bin/bash ./config.guess

The following works:

[nah599@raijin5 pio-2.4.4]$ module load intel-cc
[nah599@raijin5 pio-2.4.4]$ /bin/bash ./config.guess

This is the hanging command:

/apps/intel-ct/2019.3.199/cc/bin/icc -E /short/x77/nah599/tmp/cgm21joj/dummy.c

For the time being using old compiler versions to try to get things working.

nichannah commented 4 years ago

Current status is that PIO is building, need to modify CICE PIO support so that it works without CESM dependencies. The main difficulty here is that the CICE PIO code assumes that initialisation has already been done somewhere else (perhaps as part of a coupled model). So proper PIO initialisation needs to be written.

nichannah commented 4 years ago

The PIO code is ready to be tested however there is a problem with netcdf, compiler and openmpi version compatibility between the new CICE and the rest of the model. So this issue is now dependent on upgrading these things.

aekiss commented 4 years ago

In ACCESS-OM2 sea ice concentration is passed to MOM via OASIS https://github.com/COSIMA/01deg_jra55_iaf/blob/30df8f5fd6404aeb459ff44298936df576dfbbf0/namcouple#L295 so we could output that field in parallel via MOM.

I couldn't find a relevant diagnostic here https://github.com/COSIMA/access-om2/wiki/Technical-documentation#MOM5-diagnostics-list so it looks like we'd need to write one.

russfiedler commented 4 years ago

I've put this in the WOMBAT version but I've been holding off on issueing a pull request until @nichannah updates the way he proposes to pass new fields.

https://github.com/russfiedler/MOM5/blob/wombat/src/mom5/ocean_core/ocean_sbc.F90#L5971

russfiedler commented 4 years ago

Also as a note to above. netCDF on gadi should be suitable for PIO

nichannah commented 4 years ago

Updated PIO build instructions:

cd $ACCESS_OM_DIR/src/cice5
wget https://github.com/NCAR/ParallelIO/releases/download/pio2_5_0/pio-2.5.0.tar.gz
tar zxvf pio-2.5.0.tar.gz
cd pio-2.5.0
module load intel-compiler/2019.5.281
module load netcdf/4.7.4p
module load openmpi/4.0.2
export CC=mpicc
export FC=mpifort
./configure --enable-fortran --disable-pnetcdf --enable-logging --enable-netcdf-integration --prefix=$ACCESS_OM_DIR/src/cice5/pio-2.5.0/usr
make
make install

Note that logging is enabled above. This will need to be changed in production.

To build using Cmake:

CC=mpicc FC=mpif90 cmake -DWITH_PNETCDF=OFF -DNetCDF_C_LIBRARY="${NETCDF}/lib/ompi3/libnetcdf.so" -DNetCDF_C_INCLUDE_DIR="${NETCDF}/include/" -DNetCDF_Fortran_LIBRARY="${NETCDF}/lib/ompi3/Intel" -DNetCDF_Fortran_INCLUDE_DIR="${NETCDF}/include/Intel" ../

nichannah commented 4 years ago

preliminary results from a 10 day 0.1 run with daily cice output. previously writing output was 15% of CICE runtime, it’s now 6%.

mom now spending less than half as much time waiting on ice. from 12% or runtime down to 5%

the interesting thing now is to see how this scales. Presumably the existing approach will not scale well as we increase the number of CICE cpus. It would be good to see whether we can increase the number of CICE cpus to further reduce the MOM wait time. Aim to get this below 1%

aekiss commented 4 years ago

Thanks @nichannah, that's great news.

Did you run your test with 799 CICE cores? And am I right in thinking CICE with PIO uses all cores (rather than a subset like MOM io_layout)? If so, I'm a little surprised it didn't speed up more, if there are 799x more cores doing the output. I guess there's some extra overhead in PIO?

@marshallward's tests on Raijin showed CICE would scale well up to about 2000 cores and is still reasonable at 3000 (see table below). If so, I guess we'd need over 4000 CICE cores to get below 1% MOM wait time, which seems rather a lot. But in our standard configs (serial CICE io, monthly outputs) MOM spends just under 2% of its time waiting for CICE, so 1% is better than we're used to.

Screen Shot 2020-05-03 at Sun 3-5 9 48am

nichannah commented 4 years ago

Thanks @aekiss, that's useful.

I'm now running a test to see how a run with daily output compares to one with monthly output. If that is OK then perhaps we can start to use this feature before spending more time on optimisation.

russfiedler commented 4 years ago

I believe PIO allows some sort of flexibilty with which PEs are used https://ncar.github.io/ParallelIO/group___p_i_o__init.html . I don't know how flexible this is in what has been written for CICE. There is an interesting point made in the FAQ that it's sometimes worth moving the IO away from the root PE/task (and I presume node) due to the heavier load there. Would it be worth investigating striping the files?

nichannah commented 4 years ago

Yes, it looks like there's some configuration optimisation that we can do with this. Presently I'm just using the simplest config which is a stride of 1 - so all procs are writing output.

I have just completed two 2 month runs:

1) standard config with mostly monthly cice output (16Gb output over 2 months) 2) PIO config with all daily output (460Gb output over 2 months)

Basically 1) is doing about 8Gb per month and 2) is doing 8Gb per day.

The runtime of these two runs is almost identical. Looking at ice_diag.d the time taken for writing out history is similar but the PIO case is about 5% slower. See

/scratch/v45/nah599/access-om2/archive/01deg_jra55_iaf/output000/ice/ice_diag.d
/scratch/v45/nah599/access-om2/archive/pio_daily_01deg_jra55_iaf/output000/ice/ice_diag.d

Incidentally, there seems to be something strange happening with the atm halo timers in the new PIO run. The mean time in the PIO run is 6 seconds but for the regular run it is 106 seconds. A possible explanation for this is that the PEs within CICE are better matched so collective operations don't have to wait as long on lagging PEs.

So this new feature should allow daily ice output with no performance penalty over the existing configuration. I think it makes sense to merge this into master. Any objections? @aekiss?

Future work will involve looking at the scaling and performance of the whole model in more detail and at that point I can look at the different configuration options of PIO if ice output is a bottleneck.

aekiss commented 4 years ago

That's great that daily output can be done with nearly the same runtime. If you're confident that the output with PIO is bitwise identical to the non-PIO version then I see no reason not to merge into master, given that it makes daily output practical. @AndyHoggANU any objections?

Also is compressed output still possible with PIO?

AndyHoggANU commented 4 years ago

Yes, I would like to see PIO included if at all possible. It would make it feasible to put some daily ice fields out in the IAF run, which would be a big benefit.

aekiss commented 4 years ago

According to timer 12, PIO with daily output is slightly faster than serial with monthly output (not 5% slower):

/scratch/v45/nah599/access-om2/archive/01deg_jra55_iaf/output000/ice/ice_diag.d

Timer   1:     Total   13783.11 seconds
Timer  12: ReadWrite     934.32 seconds

/scratch/v45/nah599/access-om2/archive/pio_daily_01deg_jra55_iaf/output000/ice/ice_diag.d

Timer   1:     Total   13728.74 seconds
Timer  12: ReadWrite     871.72 seconds
nichannah commented 4 years ago

Seems like a cleaner timer to look at. I was looking at History.

russfiedler commented 4 years ago

The big one is still the time that the ice model is waiting for MOM for coupling (Timer 18: 5800s). I think/hope that this now means that upping the core count for MOM should be efficient since that occasional slowdown for CICE output won't be occurring.

nichannah commented 4 years ago

Good point. Successful balancing between models depends a lot on the balance within a model.

aekiss commented 4 years ago

@nichannah your build instructions above include a manual download of PIO - would this be better done as a PIO submodule within the CICE5 repo so that a recursive clone will get all the dependencies?

aekiss commented 4 years ago

also is your 34-pio branch up to date on https://github.com/COSIMA/cice5 ? It only seems to have changes to bld/config.nci.auscom.360x300 but not the other resolutions https://github.com/COSIMA/cice5/compare/34-pio..master

nichannah commented 4 years ago

Yes I'll clean up the build process. A sub module within the cice repo is a good idea

nichannah commented 4 years ago

I don't think it is up to date. I'll need to do a bit more work to get things in a mergable state.

aekiss commented 4 years ago

no worries, let me know when you have a configuration that I can try building

russfiedler commented 4 years ago

Do we really want to support PIO as another submodule rather than pushing to get it supported on the system? It really has a much wider use and if we're not intending to do development on it I'm not sure that it should be part of the distribution.

aekiss commented 4 years ago

Fair point. I don't have an opinion either way, just so long as there's a seamless way to clone and build it on gadi.

nichannah commented 4 years ago

Current status:

I've been doing further runs with the 0.1 deg and got a few segfaults at the end of the run. I've also noticed that there are several deficiencies in the way that the PIO interface is being used within CICE (the 32, 64 bit problem above, no error checking, bad ordering of setup and tear-down calls) so need to do a more thorough code review.

aidanheerdegen commented 4 years ago

Do we really want to support PIO as another submodule rather than pushing to get it supported on the system? It really has a much wider use and if we're not intending to do development on it I'm not sure that it should be part of the distribution.

@rxy900 might be able to comment about how to go about getting this supported on the system.

aekiss commented 4 years ago

Nic has already put in a request to have it installed as a module

nichannah commented 4 years ago

Regarding the PIO module. Andrey has pointed out that it may not be worthwhile if ACCESS-OM2 is the only user.

Also I've found that it is quite useful to be able to recompile (e.g. with debug output) and perhaps this will be needed in the future if we do further optimisation/tuning work.

I think it makes sense to compiler ourselves for the time being and then perhaps move to a central install down the track. Perhaps we could install it as a hh5 module?

nichannah commented 4 years ago

New best run is here:

/scratch/v45/nah599/access-om2/archive/pio_daily_rearr_box_01deg_jra55_iaf/output000/ice/ice_diag.d

History writing is down from around 900 secs for the plain netcdf monthly output to 300 secs for the PIO daily.

aekiss commented 3 years ago

Great work @nichannah - here are some results from a production test run compared to the same config without PIO. CICE I/O time reduced by 77% Improved load balance: MOM wait for CICE reduced by 70% (from 12.1% to 3.7% of MOM total runtime) overall SU cost and walltime reduced by 8.3%. We are now very close to 6mo/submit.

  Control directory Output directory Job Id Service Units Walltime Used (hr) Fraction of MOM runtime in oasis_recv Max MOM wait for oasis_recv (s) Max CICE wait for coupler (s) Max CICE I/O time (s) Memory Used (Gb) NCPUs Used MOM NCPUs CICE NCPUs
serial IO /home/156/aek156/payu/01deg_jra55v140_iaf_cycle2 /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2/output356 10897190 29119.68 2.80861111 0.121 1242.67672 2821.14 1242.19 3624.96 5184 4358 799
PIO /home/156/aek156/payu/01deg_jra55v140_iaf_cycle2_pio_test /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test/output356 10912826 26709.12 2.57611111 0.037 368.921135 2517.83 280.2 3788.8 5184 4358 799
change (%)     -8.28% -8.28% -69.42% -70.31% -10.75% -77.44% 4.52% 0.00% 0.00% 0.00%
aekiss commented 3 years ago

Hi @nichannah I have 0.1deg PIO test run output in /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test/output356/ice/OUTPUT that should be identical to the non-PIO case here /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2/output356/ice/OUTPUT but isn't. I'm still getting to the bottom of it.

Some differences are unimportant metadata differences, but there are also some cpu mask differences in grid variables, e.g. in TLON (PIO version is on the right): Screen Shot 2020-09-14 at Mon 14-9 9 18pm These mask differences don't seem to be present in the data variables, only the grid ones.

nichannah commented 3 years ago

Thanks @aekiss for finding this. I'll look into it as well, I have a couple of vague ideas about what might be causing this.

nichannah commented 3 years ago

Hi @aekiss, I have fixed this and pushed changes to my 01deg_jra55v140_iaf_cycle2_with_pio pull request. I needed to update the CICE executable. The problem was that I had modified the CICE block distribution algorithm to make sure each CPU had an equal number of blocks. This is not needed so in general I've changed it to a namelist option.

aekiss commented 3 years ago

ok thanks @nichannah , let me know when you've updated the exe and I'll give it another whirl

nichannah commented 3 years ago

@aekiss, the exe is here:

/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_7c74942_libaccessom2_d914095.exe

It should be referenced in the config.yaml that I pushed to the pull request on your branch.

aekiss commented 3 years ago

thanks @nichannah I have a test running from /home/156/aek156/payu/01deg_jra55v140_iaf_cycle2_pio_test2

aekiss commented 3 years ago

Looks like you've nailed it @nichannah - apart from some expected changes in a few global attributes the output from the test run is identical to the previous one according to this test: https://github.com/aekiss/notebooks/blob/master/check_pio.ipynb So as far as I can see it's ready for production use.

aekiss commented 3 years ago

reopening - I hadn't checked the restarts and the iced.* restarts seem to have the same processor mask issue, eg Tsfcn in /scratch/v14/pas548/restarts/KEEP/restart356/ice/iced.1986-04-01-00000.nc (left) and /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test2/restart356/ice/iced.1986-04-01-00000.nc (right) Screen Shot 2020-09-22 at Tue 22-9 1 29pm

@nichannah can you have a look at this please?

nichannah commented 3 years ago

Hi @aekiss,

The reason for the above is that the old approach was:

  1. master task fills a global sized array with a default value (in this case 0)
  2. it then gathers restart field and puts them in above array
  3. write out

So the land was explicitly written out at 0. Whereas in the new approach the PEs each write out the part of the domain that they cover and the rest is left as the netcdf fillvalue.

So I think that in this case the new way is not incorrect - values over land should be undefined rather than set to 0.

The possible problem is that when we change number or layout of CICE CPUs the missing values may not be in exactly the same places, which makes it difficult to use restarts on an altered configuration. Much like the ocean model where it's necessary to collate when restarting with a changed config. Actually the above should only be a problem if CICE is doing calculations over land.

aekiss commented 3 years ago

ah ok. So it it possible for the master task to also fill in the gaps? Or even simpler, set the fill value to 0?

nichannah commented 3 years ago

OK, so it doesn't look like CICE does a very thorough job of checking the land mask. My crash in: ice_atmo.F90:352 does not have a check. I think the solution to this is to fill land with 0 as before. The alternative is that we would need to do processing on restarts if we every need to use them for a new config.

aekiss commented 3 years ago

I agree that filling land with 0 seems the better option, rather than hoping we remember this gotcha into the indefinite future...

nichannah commented 3 years ago

The solution to this is not completely satisfactory. The obvious way to get netcdf to put 0's in places where no data is written is to set _FillValue = 0. This can be a bit confusing because there is no difference between "no data" and "data with value 0". However I think this is probably still better than the alternative which is needing to fix-up CICE restarts whenever the PE layout changes.

See attached Tsfcn, the white has value 0 and the red is mostly -1.8.

Screen Shot 2020-10-08 at 10 49 09 pm
aekiss commented 3 years ago

I don't think setting _FillValue = 0 is a problem for this particular variable, as zero values are used for Tsfcn at land points outside the land mask in the PIO restarts anyway.

e.g. see Tsfcn in /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test2/restart356/ice/iced.1986-04-01-00000.nc (the range is narrowed for clarity): Screen Shot 2020-10-09 at Fri 9-10 4 36pm

Setting _FillValue = 0 is consistent with what was done in the non-PIO restarts, which had zero throughout the land points - e.g. /scratch/v14/pas548/restarts/KEEP/restart356/ice/iced.1986-04-01-00000.nc: Screen Shot 2020-10-09 at Fri 9-10 4 29pm

However I haven't looked at the other restart fields or files so maybe there would be problems with them?

I guess a safe thing would be to set the _FillValue for each field to the value from a land point that isn't masked out?