Investigate using parallel IO

aekiss commented 5 years ago

It may be worth trying to compile with parallel IO using PIO (setenv IO_TYPE pio).

We currently compile CICE with serial IO (setenv IO_TYPE netcdf in bld/build.sh), so one CPU does all the IO and we end up with an Amdahl's law situation that limits the scalability with large core counts.

At 0.1 deg CICE is IO-bound when doing daily outputs (see Timer 12 in ice_diag.d), and the time spent in CICE IO accounts for almost all the time MOM waits for CICE (oasis_recv in access-om2.out) so the whole coupled model is waiting on one cpu. With daily CICE output at 0.1deg this is ~19% of the model runtime (it's only ~2% without daily CICE output). Lowering the compression level to 1 (https://github.com/COSIMA/cice5/issues/33) has helped (MOM wait was 23% with level 5), and omitting static field output (https://github.com/COSIMA/cice5/issues/32) would also help.

Also I understand that PIO doesn't support compression - is that correct?

@russfiedler had these comments on Slack:

I have a feeling that the CICE parallel IO hadn't really been tested or there was some problem with it. We would have to update the netcdf versions being used in CICE for a start. the distributors of PIO note that they need to use netCDF 4.6.1 and HDF5 1.10.4 or later for their latest version. There's a bug in parallel collective IO in earlier hdf5 versions. The NCI version of netCDF 4.6.1 is built with hdf5 1.10.2! marshall noted above that Rui found a performance drop off when moving from 1.10.2 to 1.10.4. the gather is done on all the small tiles. So you have each PE sending a single horizontal slab several times to the root PE for each level. the number of MPI calls is probably the main issue. It looks like there's an individual send/recv for each tile rather than either a bulk send of the tiles or something more funky using MPI_Gather(v) and MPI_Type_create_subarray.

Slack discussion: https://arccss.slack.com/archives/C9Q7Y1400/p1557272377089800

aekiss commented 4 years ago

For example the point (474, 2613) is land but unmasked so you could check its value for every field in every restart file in /scratch/v14/pas548/restarts/KEEP/restart356/ice/ or /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf_cycle2_pio_test2/restart356/ice/iced.1986-04-01-00000.nc and use this as the _FillValue

aidanheerdegen commented 4 years ago

CF conventions allow for _FillValue and missing_value. Is missing_value is set to something that is non-zero does that help?

http://cfconventions.org/cf-conventions/cf-conventions.html#missing-data

aekiss commented 3 years ago

Thanks @nichannah, I'm closing this issue now.

We decided in the 14 Oct TWG meeting that this issue with restarts is not significant enough to warrant fixing, and that a fix with a change to _FillValue=0 would cause more trouble than it was worth, since genuine data could be misinterpreted as fill.

We just need to remember to fill in the cpu-masked cells with zero values if restarting with a changed cpu layout.

I've done a test run at 0.1deg with PIO (using commit 7c74942) to compare to one without PIO (using commit 26e6159). This restart issue means I can't compare the restart files, but I've confirmed (using xarray's identical method) that the outputs are identical, including for a second run based on PIO-generated restarts, so I'm confident that the model state is unaffected by these differences in the restart files. Test script is here: https://github.com/aekiss/notebooks/blob/72986342795e6fef167ad5d9df76a01b1ad7fefa/check_pio.ipynb

aekiss commented 3 years ago

Sorry @nic, I'm reopening again - I've hit a bug using PIO in a 1deg configuration.

For the 1deg config I'm using one core per chunk, laid out the same way as slenderX1 (not sure if this is the best choice?)

    history_chunksize_x = 15
    history_chunksize_y = 300

I have repeated identical runs

/home/156/aek156/payu/testing/all-configs/v2.0.0rc9/1deg_jra55_ryf_v2.0.0rc9
/home/156/aek156/payu/testing/all-configs/v2.0.0rc9/1deg_jra55_ryf_v2.0.0rc9xx

and got differing output in these files and variables:

/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9-CHUCKABLE/output000/ice/OUTPUT/iceh.1900-01.nc
/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9xx-CHUCKABLE/output000/ice/OUTPUT/iceh.1900-01.nc
fsurfn_ai_m
vicen_m

/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9-CHUCKABLE/output000/ice/OUTPUT/iceh.1900-02.nc
/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9xx-CHUCKABLE/output000/ice/OUTPUT/iceh.1900-02.nc
fmelttn_ai_m
vicen_m

/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9-CHUCKABLE/output001/ice/OUTPUT/iceh.1900-04.nc
/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9xx-CHUCKABLE/output001/ice/OUTPUT/iceh.1900-04.nc
aicen_m
flatn_ai_m
fmelttn_ai_m

/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9-CHUCKABLE/output001/ice/OUTPUT/iceh.1900-05.nc
/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9xx-CHUCKABLE/output001/ice/OUTPUT/iceh.1900-05.nc
flatn_ai_m
fsurfn_ai_m
vicen_m

/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9-CHUCKABLE/output002/ice/OUTPUT/iceh.1900-07.nc
/scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9xx-CHUCKABLE/output002/ice/OUTPUT/iceh.1900-07.nc
fcondtopn_ai_m

Note that this issue only appears in multi-category variables (e.g. 'fcondtopn_ai_m' (time: 1, nc: 5, nj: 300, ni: 360)) and is unpredictable - most multi-category variables are ok most of the time, and there are no variables that are always affected.

For example here's category 0 of fmelttn_ai_m in /scratch/x77/aek156/1deg_jra55_ryf_v2.0.0rc9xx-CHUCKABLE/output000/ice/OUTPUT/iceh.1900-02.nc Screen Shot 2020-10-28 at Wed 28-10 9 49am There are bad points just north of the Equator over a limited longitude range in the Indonesian archipelago. They are extremely large, presumably uninitialised values. The values in the longitudes between them are very small but nonzero (they should be zero). The land mask is also messed up.

The problem occurs in different places in other fields.

I've only seen this problem in category 0, but I haven't checked thoroughly. e.g here's category 1 of the same field and file: Screen Shot 2020-10-28 at Wed 28-10 9 50am

I didn't see this issue with the 0.1deg config. Maybe I need better choices for history_chunksize_x and history_chunksize_y? (NB I found I could get segfaults if I wasn't careful with these values...)

aekiss commented 3 years ago

Oops, apologies @nichannah - this was just because I was calling mpirun with the wrong options at 1 deg.

When I use

mpirun: --mca io ompio --mca io_ompio_num_aggregators 1

in config.yaml it works as expected.

aidanheerdegen commented 3 years ago

The OpenMPI docs say ompio is the default for versions > 2.x. Is that incorrect?

https://www.open-mpi.org/faq/?category=ompio

nichannah commented 3 years ago

On Gadi it appears that romio is used by default. Also we need to specify the number of MPI aggregators explicitly to avoid the heuristic/algorithm that usually sets this. This algorithm appears to get confused with the combination of (chunksize != tile size) and deflation on. The confusion leads to a divide-by-zero. I haven't spent the time to really understand this bug/problem so you could say that --mca io_ompio_num_aggregators 1 is a work-around.

aidanheerdegen commented 3 years ago

Thanks for the explanation @nichannah

aekiss commented 3 years ago

@nichannah FYI: PIO seems to slow down CICE at 1 deg. see 3-month runs in /home/156/aek156/payu/testing/all-configs/v2.0.0rc9

Fraction of MOM runtime in oasis_recv, Max CICE I/O time (s) 1 deg, no PIO 1deg_jra55_ryf_v2.0.0rc9_nopio: 0.04, 10.6 1 deg, PIO with 24 chunks (15x300) 1deg_jra55_ryf_v2.0.0rc9_pio: 0.062, 15.3 1 deg, PIO with 1 chunk (360x300) 1deg_jra55_ryf_v2.0.0rc9_pio_1chunk: 0.096, 24.4

but it is improved at 0.25deg: 0.25 deg, no PIO 025deg_jra55_ryf_v2.0.0rc9: 0.078, 54 0.25 deg, PIO with 100 chunks (144x108) 025deg_jra55_ryf_v2.0.0rc9_pio2: 0.04, 25

The cice cores are spread between nodes on gadi at 1deg with 1+216+24 cores for yatm/mom/cice so that might be part of the problem: https://github.com/COSIMA/access-om2/issues/212 and https://github.com/COSIMA/access-om2/issues/202

aekiss commented 3 years ago

I've also tried 1 deg (/home/156/aek156/payu/testing/all-configs/v2.0.0rc10/1deg_jra55_iaf_v2.0.0rc10) and 0.25 deg (025deg_jra55_iaf_v2.0.0rc10) configs with 4 chunks (90x300 at 1 deg; 720x540 at 0.25 deg) and get 0.085 for the fraction of MOM runtime in oasis_recv in both cases.

1 deg with 4 chunks is almost as fast as the 24-chunk case (though slower than without PIO) but should be faster to read in most circumstances than 24 chunks. However I'm thinking a 180x150 4-chunk layout is probably a better match to hemisphere-based access patterns so I might try that too. This run was for 5 years, rather than 3mo as in the previous and next posts so I haven't included Max CICE I/O time. It's a bit faster in a 3mo test - see next post.

0.25 deg with 4 chunks is now somewhat slower than without PIO but I'm reluctant to use too many chunks in case it slows down reading. Note that this run was for 2 years, rather than 3mo as in the previous and next posts.

Also I should have mentioned that these 1 deg and 0.25 deg tests all had identical ice outputs, but they differ from the ice outputs in the production 0.1deg runs I reported here so they aren't directly comparable to those.

aekiss commented 3 years ago

Some more tests of differing history_chunksize_x x history_chunksize_y with 3mo runs at 1 deg in /home/156/aek156/payu/testing/all-configs/v2.0.0rc:

Fraction of MOM runtime in oasis_recv, Max CICE I/O time (s) 1 deg, PIO with 4 chunks (90x300) 1deg_jra55_iaf_v2.0.0rc10_3mo: 0.067, 16.8 1 deg, PIO with 4 chunks (180x150) 1deg_jra55_iaf_v2.0.0rc10_3mo_180x150: 0.072, 18.1

The first of these is slightly faster (presumably because it is consistent with the 15x300 core layout) but the difference is small and so I will use 180x150 for the new 1deg configs as this is better suited to typical access patterns of reading one hemisphere or the other.

The fraction of MOM runtime in oasis_recv values with 90x300 is smaller in the 3 mo case compared to 5yr: 0.067 rather than 0.085 (see prev post). So for 3mo runs the 4-chunk cases (0.067, 0.072) are nearly as fast as the 24-chunk case (0.062) and considerably faster than 1 chunk (0.096) - see post before last.

aekiss commented 3 years ago

For future reference: the processor masking in the ice restarts can be fixed with https://github.com/COSIMA/topogtools/blob/master/fix_ice_restarts.py, allowing a change in processor layout during a run.

access-hive-bot commented 10 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/3

COSIMA / cice5

Investigate using parallel IO #34