Closed aekiss closed 3 years ago
I've been looking at the CICE PIO code. It is not as complete as the serial netcdf code, for example it doesn't do error proper checking. The PIO code still exists and is documented in CICE6.
My next step is to see whether it can be built on raijin.
Another option, which may be better even if PIO works is to take the MOM5 approach and have each PE output to it's own file followed by an offline collate. The advantage of this would be that we can continue to use the existing netcdf code (with slight modifications). The down-side would be that we need to write a collate program.
@nichannah I think with the moves by Ed Hartnett wrt implementing PIO in FMS I think it would be best to go the PIO route to stay reasonably compatible with future FMS and CICEn.
Steps to build PIO cice.
wget https://github.com/NCAR/ParallelIO/releases/download/pio2_4_4/pio-2.4.4.tar.gz
tar zxvf pio-2.4.4.tar.gz
module load intel-cc/2018.3.222
module load intel-fc/2018.3.222
module load netcdf/4.6.1p
module load openmpi/4.0.1
I also tried openmpi/1.10.2 but the build failed with link errors.
export CPPFLAGS='-std=c99 -I${NETCDF}/include/ -L${PARALLEL_NETCDF_BASE}/include/'
export LDFLAGS='-L${NETCDF}/lib/ -L${PARALLEL_NETCDF_BASE}/lib/'
./configure --enable-fortran --prefix=/short/x77/nah599/access-om2/src/cice5/pio-2.4.4/usr
make
make install
I looks like the CICE PIO code makes use of something called shr_pio_mod. Getting compile error like:
ice_pio.f90(9): error #7002: Error in opening the compiled module file. Check INCLUDE paths. [SHR_SYS_MOD]
use shr_sys_mod , only: shr_sys_flush
------^
ice_pio.f90(7): error #7002: Error in opening the compiled module file. Check INCLUDE paths. [SHR_KIND_MOD]
use shr_kind_mod, only: r8 => shr_kind_r8, in=>shr_kind_in
------^
ice_pio.f90(47): error #7002: Error in opening the compiled module file. Check INCLUDE paths. [SHR_PIO_MOD]
use shr_pio_mod, only: shr_pio_getiosys, shr_pio_getiotype
-------^
The code can be found here:
https://github.com/CESM-Development/cesm-git-experimental/tree/master/cesm/models/csm_share
Netcdf 4.7.1 is now installed on raijin on top of hdf5/1.10.5. The parallel version, 4.7.1p (and hdf5/1.10.5p), is built with openmpi/4.0.1.
Followed my instructions as above with new versions and the configure step hangs. This seems to be caused by:
[nah599@raijin5 pio-2.4.4]$ module load intel-cc/17.0.1.132
[nah599@raijin5 pio-2.4.4]$ /bin/bash ./config.guess
The following works:
[nah599@raijin5 pio-2.4.4]$ module load intel-cc
[nah599@raijin5 pio-2.4.4]$ /bin/bash ./config.guess
This is the hanging command:
/apps/intel-ct/2019.3.199/cc/bin/icc -E /short/x77/nah599/tmp/cgm21joj/dummy.c
For the time being using old compiler versions to try to get things working.
Current status is that PIO is building, need to modify CICE PIO support so that it works without CESM dependencies. The main difficulty here is that the CICE PIO code assumes that initialisation has already been done somewhere else (perhaps as part of a coupled model). So proper PIO initialisation needs to be written.
The PIO code is ready to be tested however there is a problem with netcdf, compiler and openmpi version compatibility between the new CICE and the rest of the model. So this issue is now dependent on upgrading these things.
In ACCESS-OM2 sea ice concentration is passed to MOM via OASIS https://github.com/COSIMA/01deg_jra55_iaf/blob/30df8f5fd6404aeb459ff44298936df576dfbbf0/namcouple#L295 so we could output that field in parallel via MOM.
I couldn't find a relevant diagnostic here https://github.com/COSIMA/access-om2/wiki/Technical-documentation#MOM5-diagnostics-list so it looks like we'd need to write one.
I've put this in the WOMBAT version but I've been holding off on issueing a pull request until @nichannah updates the way he proposes to pass new fields.
https://github.com/russfiedler/MOM5/blob/wombat/src/mom5/ocean_core/ocean_sbc.F90#L5971
Also as a note to above. netCDF on gadi should be suitable for PIO
Updated PIO build instructions:
cd $ACCESS_OM_DIR/src/cice5
wget https://github.com/NCAR/ParallelIO/releases/download/pio2_5_0/pio-2.5.0.tar.gz
tar zxvf pio-2.5.0.tar.gz
cd pio-2.5.0
module load intel-compiler/2019.5.281
module load netcdf/4.7.4p
module load openmpi/4.0.2
export CC=mpicc
export FC=mpifort
./configure --enable-fortran --disable-pnetcdf --enable-logging --enable-netcdf-integration --prefix=$ACCESS_OM_DIR/src/cice5/pio-2.5.0/usr
make
make install
Note that logging is enabled above. This will need to be changed in production.
To build using Cmake:
CC=mpicc FC=mpif90 cmake -DWITH_PNETCDF=OFF -DNetCDF_C_LIBRARY="${NETCDF}/lib/ompi3/libnetcdf.so" -DNetCDF_C_INCLUDE_DIR="${NETCDF}/include/" -DNetCDF_Fortran_LIBRARY="${NETCDF}/lib/ompi3/Intel" -DNetCDF_Fortran_INCLUDE_DIR="${NETCDF}/include/Intel" ../
preliminary results from a 10 day 0.1 run with daily cice output. previously writing output was 15% of CICE runtime, it’s now 6%.
mom now spending less than half as much time waiting on ice. from 12% or runtime down to 5%
the interesting thing now is to see how this scales. Presumably the existing approach will not scale well as we increase the number of CICE cpus. It would be good to see whether we can increase the number of CICE cpus to further reduce the MOM wait time. Aim to get this below 1%
Thanks @nichannah, that's great news.
Did you run your test with 799 CICE cores? And am I right in thinking CICE with PIO uses all cores (rather than a subset like MOM io_layout)? If so, I'm a little surprised it didn't speed up more, if there are 799x more cores doing the output. I guess there's some extra overhead in PIO?
@marshallward's tests on Raijin showed CICE would scale well up to about 2000 cores and is still reasonable at 3000 (see table below). If so, I guess we'd need over 4000 CICE cores to get below 1% MOM wait time, which seems rather a lot. But in our standard configs (serial CICE io, monthly outputs) MOM spends just under 2% of its time waiting for CICE, so 1% is better than we're used to.
Thanks @aekiss, that's useful.
I'm now running a test to see how a run with daily output compares to one with monthly output. If that is OK then perhaps we can start to use this feature before spending more time on optimisation.
I believe PIO allows some sort of flexibilty with which PEs are used https://ncar.github.io/ParallelIO/group___p_i_o__init.html . I don't know how flexible this is in what has been written for CICE. There is an interesting point made in the FAQ that it's sometimes worth moving the IO away from the root PE/task (and I presume node) due to the heavier load there. Would it be worth investigating striping the files?
Yes, it looks like there's some configuration optimisation that we can do with this. Presently I'm just using the simplest config which is a stride of 1 - so all procs are writing output.
I have just completed two 2 month runs:
1) standard config with mostly monthly cice output (16Gb output over 2 months) 2) PIO config with all daily output (460Gb output over 2 months)
Basically 1) is doing about 8Gb per month and 2) is doing 8Gb per day.
The runtime of these two runs is almost identical. Looking at ice_diag.d the time taken for writing out history is similar but the PIO case is about 5% slower. See
/scratch/v45/nah599/access-om2/archive/01deg_jra55_iaf/output000/ice/ice_diag.d
/scratch/v45/nah599/access-om2/archive/pio_daily_01deg_jra55_iaf/output000/ice/ice_diag.d
Incidentally, there seems to be something strange happening with the atm halo timers in the new PIO run. The mean time in the PIO run is 6 seconds but for the regular run it is 106 seconds. A possible explanation for this is that the PEs within CICE are better matched so collective operations don't have to wait as long on lagging PEs.
So this new feature should allow daily ice output with no performance penalty over the existing configuration. I think it makes sense to merge this into master. Any objections? @aekiss?
Future work will involve looking at the scaling and performance of the whole model in more detail and at that point I can look at the different configuration options of PIO if ice output is a bottleneck.
That's great that daily output can be done with nearly the same runtime. If you're confident that the output with PIO is bitwise identical to the non-PIO version then I see no reason not to merge into master, given that it makes daily output practical. @AndyHoggANU any objections?
Also is compressed output still possible with PIO?
Yes, I would like to see PIO included if at all possible. It would make it feasible to put some daily ice fields out in the IAF run, which would be a big benefit.
According to timer 12, PIO with daily output is slightly faster than serial with monthly output (not 5% slower):
/scratch/v45/nah599/access-om2/archive/01deg_jra55_iaf/output000/ice/ice_diag.d
Timer 1: Total 13783.11 seconds
Timer 12: ReadWrite 934.32 seconds
/scratch/v45/nah599/access-om2/archive/pio_daily_01deg_jra55_iaf/output000/ice/ice_diag.d
Timer 1: Total 13728.74 seconds
Timer 12: ReadWrite 871.72 seconds
Seems like a cleaner timer to look at. I was looking at History.
The big one is still the time that the ice model is waiting for MOM for coupling (Timer 18: 5800s). I think/hope that this now means that upping the core count for MOM should be efficient since that occasional slowdown for CICE output won't be occurring.
Good point. Successful balancing between models depends a lot on the balance within a model.
@nichannah your build instructions above include a manual download of PIO - would this be better done as a PIO submodule within the CICE5 repo so that a recursive clone will get all the dependencies?
also is your 34-pio
branch up to date on https://github.com/COSIMA/cice5 ? It only seems to have changes to bld/config.nci.auscom.360x300
but not the other resolutions
https://github.com/COSIMA/cice5/compare/34-pio..master
Yes I'll clean up the build process. A sub module within the cice repo is a good idea
I don't think it is up to date. I'll need to do a bit more work to get things in a mergable state.
no worries, let me know when you have a configuration that I can try building
Do we really want to support PIO as another submodule rather than pushing to get it supported on the system? It really has a much wider use and if we're not intending to do development on it I'm not sure that it should be part of the distribution.
Fair point. I don't have an opinion either way, just so long as there's a seamless way to clone and build it on gadi.
Current status:
I've been doing further runs with the 0.1 deg and got a few segfaults at the end of the run. I've also noticed that there are several deficiencies in the way that the PIO interface is being used within CICE (the 32, 64 bit problem above, no error checking, bad ordering of setup and tear-down calls) so need to do a more thorough code review.
Do we really want to support PIO as another submodule rather than pushing to get it supported on the system? It really has a much wider use and if we're not intending to do development on it I'm not sure that it should be part of the distribution.
@rxy900 might be able to comment about how to go about getting this supported on the system.
Nic has already put in a request to have it installed as a module
Regarding the PIO module. Andrey has pointed out that it may not be worthwhile if ACCESS-OM2 is the only user.
Also I've found that it is quite useful to be able to recompile (e.g. with debug output) and perhaps this will be needed in the future if we do further optimisation/tuning work.
I think it makes sense to compiler ourselves for the time being and then perhaps move to a central install down the track. Perhaps we could install it as a hh5 module?
New best run is here:
/scratch/v45/nah599/access-om2/archive/pio_daily_rearr_box_01deg_jra55_iaf/output000/ice/ice_diag.d
History writing is down from around 900 secs for the plain netcdf monthly output to 300 secs for the PIO daily.
Great work @nichannah - here are some results from a production test run compared to the same config without PIO. CICE I/O time reduced by 77% Improved load balance: MOM wait for CICE reduced by 70% (from 12.1% to 3.7% of MOM total runtime) overall SU cost and walltime reduced by 8.3%. We are now very close to 6mo/submit.
Control directory | Output directory | Job Id | Service Units | Walltime Used (hr) | Fraction of MOM runtime in oasis_recv | Max MOM wait for oasis_recv (s) | Max CICE wait for coupler (s) | Max CICE I/O time (s) | Memory Used (Gb) | NCPUs Used | MOM NCPUs | CICE NCPUs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
serial IO | /home/156/aek156/payu/01deg_jra55v140_iaf_cycle2 | /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2/output356 | 10897190 | 29119.68 | 2.80861111 | 0.121 | 1242.67672 | 2821.14 | 1242.19 | 3624.96 | 5184 | 4358 | 799 |
PIO | /home/156/aek156/payu/01deg_jra55v140_iaf_cycle2_pio_test | /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test/output356 | 10912826 | 26709.12 | 2.57611111 | 0.037 | 368.921135 | 2517.83 | 280.2 | 3788.8 | 5184 | 4358 | 799 |
change (%) | -8.28% | -8.28% | -69.42% | -70.31% | -10.75% | -77.44% | 4.52% | 0.00% | 0.00% | 0.00% |
Hi @nichannah I have 0.1deg PIO test run output in
/scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test/output356/ice/OUTPUT
that should be identical to the non-PIO case here
/scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2/output356/ice/OUTPUT
but isn't. I'm still getting to the bottom of it.
Some differences are unimportant metadata differences, but there are also some cpu mask differences in grid variables, e.g. in TLON
(PIO version is on the right):
These mask differences don't seem to be present in the data variables, only the grid ones.
Thanks @aekiss for finding this. I'll look into it as well, I have a couple of vague ideas about what might be causing this.
Hi @aekiss, I have fixed this and pushed changes to my 01deg_jra55v140_iaf_cycle2_with_pio pull request. I needed to update the CICE executable. The problem was that I had modified the CICE block distribution algorithm to make sure each CPU had an equal number of blocks. This is not needed so in general I've changed it to a namelist option.
ok thanks @nichannah , let me know when you've updated the exe and I'll give it another whirl
@aekiss, the exe is here:
/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_7c74942_libaccessom2_d914095.exe
It should be referenced in the config.yaml that I pushed to the pull request on your branch.
thanks @nichannah I have a test running from /home/156/aek156/payu/01deg_jra55v140_iaf_cycle2_pio_test2
Looks like you've nailed it @nichannah - apart from some expected changes in a few global attributes the output from the test run is identical to the previous one according to this test: https://github.com/aekiss/notebooks/blob/master/check_pio.ipynb So as far as I can see it's ready for production use.
reopening - I hadn't checked the restarts and the iced.*
restarts seem to have the same processor mask issue,
eg Tsfcn
in
/scratch/v14/pas548/restarts/KEEP/restart356/ice/iced.1986-04-01-00000.nc
(left) and
/scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test2/restart356/ice/iced.1986-04-01-00000.nc
(right)
@nichannah can you have a look at this please?
Hi @aekiss,
The reason for the above is that the old approach was:
So the land was explicitly written out at 0. Whereas in the new approach the PEs each write out the part of the domain that they cover and the rest is left as the netcdf fillvalue.
So I think that in this case the new way is not incorrect - values over land should be undefined rather than set to 0.
The possible problem is that when we change number or layout of CICE CPUs the missing values may not be in exactly the same places, which makes it difficult to use restarts on an altered configuration. Much like the ocean model where it's necessary to collate when restarting with a changed config. Actually the above should only be a problem if CICE is doing calculations over land.
ah ok. So it it possible for the master task to also fill in the gaps? Or even simpler, set the fill value to 0?
OK, so it doesn't look like CICE does a very thorough job of checking the land mask. My crash in: ice_atmo.F90:352
does not have a check. I think the solution to this is to fill land with 0 as before. The alternative is that we would need to do processing on restarts if we every need to use them for a new config.
I agree that filling land with 0 seems the better option, rather than hoping we remember this gotcha into the indefinite future...
The solution to this is not completely satisfactory. The obvious way to get netcdf to put 0's in places where no data is written is to set _FillValue = 0. This can be a bit confusing because there is no difference between "no data" and "data with value 0". However I think this is probably still better than the alternative which is needing to fix-up CICE restarts whenever the PE layout changes.
See attached Tsfcn, the white has value 0 and the red is mostly -1.8.
I don't think setting _FillValue = 0 is a problem for this particular variable, as zero values are used for Tsfcn
at land points outside the land mask in the PIO restarts anyway.
e.g. see Tsfcn
in /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test2/restart356/ice/iced.1986-04-01-00000.nc
(the range is narrowed for clarity):
Setting _FillValue = 0 is consistent with what was done in the non-PIO restarts, which had zero throughout the land points - e.g. /scratch/v14/pas548/restarts/KEEP/restart356/ice/iced.1986-04-01-00000.nc
:
However I haven't looked at the other restart fields or files so maybe there would be problems with them?
I guess a safe thing would be to set the _FillValue for each field to the value from a land point that isn't masked out?
It may be worth trying to compile with parallel IO using PIO (
setenv IO_TYPE pio
).We currently compile CICE with serial IO (
setenv IO_TYPE netcdf
inbld/build.sh
), so one CPU does all the IO and we end up with an Amdahl's law situation that limits the scalability with large core counts.At 0.1 deg CICE is IO-bound when doing daily outputs (see
Timer 12
inice_diag.d
), and the time spent in CICE IO accounts for almost all the time MOM waits for CICE (oasis_recv
inaccess-om2.out
) so the whole coupled model is waiting on one cpu. With daily CICE output at 0.1deg this is ~19% of the model runtime (it's only ~2% without daily CICE output). Lowering the compression level to 1 (https://github.com/COSIMA/cice5/issues/33) has helped (MOM wait was 23% with level 5), and omitting static field output (https://github.com/COSIMA/cice5/issues/32) would also help.Also I understand that PIO doesn't support compression - is that correct?
@russfiedler had these comments on Slack:
Slack discussion: https://arccss.slack.com/archives/C9Q7Y1400/p1557272377089800