COSIMA / access-om3

ACCESS-OM3 global ocean-sea ice-wave coupled model
13 stars 6 forks source link

Parallel IO in CICE #81

Open anton-seaice opened 10 months ago

anton-seaice commented 10 months ago

In OM2 - a fair bit of work was done to add parrallel writing of netcdf output to get around delays writing daily out from CICE:

https://github.com/COSIMA/cice5/issues/34

https://github.com/COSIMA/cice5/commit/e9575cdafea4a5fa976864e00d405b1090de4091

The ice_history code between CICE5 and 6 looks largely unchanged, so we will probably need to make similar changes to CICE6?

micaeljtoliveira commented 10 months ago

CICE6 has the option to perform IO using parallelio. This is implemented here:

https://github.com/CICE-Consortium/CICE/tree/main/cicecore/cicedyn/infrastructure/io/io_pio2

My understanding is that, when using it, it replaces the serial IO entirely, which is probably why this is not obvious in ice_history.F90 .

Note that, currently, the default build option in OM3 is to use PIO (see here

anton-seaice commented 10 months ago

Thanks Micael

Maybe I misunderstood the changes done to CICE5 and https://github.com/COSIMA/cice5/commit/e9575cdafea4a5fa976864e00d405b1090de4091 is just about adding the chunking features and some other improvements? But the parrallel IO was already working?

@aekiss - Can you confirm?

micaeljtoliveira commented 10 months ago

@anton-seaice I think the development of PIO support in the COSIMA fork of CICE5 and in CICE6 were done independently. So they might not provide exactly the same features. Still, very likely the existing PIO support in CICE6 is good enough for our needs, although that needs to be tested.

anton-seaice commented 10 months ago

Using the config from https://github.com/COSIMA/MOM6-CICE6/pull/17 , ice.log gives these times:

Timer   1:     Total     173.07 seconds
Timer  13:   History      43.67 seconds

It's not clear to me if that is a problem (times are not mutually exclusive), and we might not know until we try the higher resolutions.

There are a couple of other issues though:

Monthly output in OM2 was ~17mb:

-rw-r-----+ 1 rmh561 ik11 7.6M May 11 2022 /g/data/ik11/outputs/access-om2/1deg_era5_ryf/output000/ice/OUTPUT/iceh.1900-01.nc

But the OM3 output is ~69mb -rwxrwx--x 1 as2285 tm70 69M Nov 3 14:22 GMOM_JRA.cice.h.0001-01.nc

The history output is not chunked And @dougie pointed out the history output is being written in "64-bit offset" which is a very dated way to write output which predates NetCDF-4

anton-seaice commented 10 months ago

It looks like we need to set pio_typename = netcdf4p in nuopc.runconfig to turn this on (per med_io_mod )

But when I do this, i get this error in access-om3.err:

get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Obtained 10 stack frames.
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(print_trace+0x29) [0x147f3a88eff9]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(piodie+0x42) [0x147f3a88d082]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(check_netcdf2+0x1b9) [0x147f3a88d019]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(PIOc_openfile_retry+0x855) [0x147f3a88d9f5]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(PIOc_openfile+0x16) [0x147f3a8887e6]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpiof.so(piolib_mod_mp_pio_openfile_+0x21f) [0x147f3a61dacf]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x4082508]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x408b56f]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x42544bd]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x40589e5]

The No data available is curious. I think its trying to open the restart file (which works fine if pio_typename = netcdf). This implies it could be missing dependencies - are we including both the HDF5 and PnetCDF libraries ? Where would I find out? (more importantly)

micaeljtoliveira commented 10 months ago

The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one.

HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support.

aekiss commented 10 months ago

Possibly relevant: https://github.com/COSIMA/access-om2/issues/166

anton-seaice commented 10 months ago

The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one.

HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support.

Thanks - this sounds ok. HDF5 is the one we want, and the ParrallelIO library should be backward compatible without pnetcdf.

I am still getting the "NetCDF: Error initializing for parallel access" error when reading files (although I can generate netcdf4 files ok). The error text comes from the Netcdf library but it looks like it could be an error from the HDF library. I can't see any error logs from the HDF5 library though? I wonder if building hdf in Build Mode: 'Debug' rather than release will generate error messages (or at least lines numbers in the stack trace)?

access-hive-bot commented 9 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1

anton-seaice commented 9 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1

I was way off on a tangent. The ParallelIO library doesn't like using a symlink to the initial conditions file, and this gives the get_stripe failed errror.

anton-seaice commented 9 months ago

I raised an issue for the code changes needed for chunking and compression: https://github.com/CICE-Consortium/CICE/issues/914

anton-seaice commented 9 months ago

For anyone reading later, Dale Roberts and OpenMPI both suggested setting the mpi io library to romio321 instead of ompio (the default).

(i.e. mpirun --mca io romio321 ./cice)

Which works and open files through the symlink, but there is a significant performance hit. Monthly runs (with some daily output) have history timers in the ice.log of approximately double (99 seconds vs 54 seconds, 48 cores, 12 pio tasks, pio_type=netcdf4p).

It looks like ompio was deliberately chosen in OM2, (see https://cosima.org.au/index.php/category/minutes/ and https://github.com/COSIMA/cice5/issues/34#issuecomment-721437337) but the details are pretty minimal. So it doesn't seem like a good fix.

There is an open issue with OpenMPI still: https://github.com/open-mpi/ompi/issues/12141

dsroberts commented 9 months ago

Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here: In my experience ROMIO is very sensitive to tuning parameters. If your lustre striping, buffer sizes and aggregator settings don't line up just so, performance is barely any better than sequential writes because that's more or less what it'll be doing under the hood. It does require a bit of thought, and it very much depends on your application's output patterns. For what its worth, I recently did some MPI-IO tuning a high-resolution regional atmosphere simulation. Picking the correct MPI-IO settings improved the write performance from ~400MB/s to 2.5-3GB/s sustained to a single file. If your pio tasks aggregate data sequentially, then the general advice is set lustre_stripe_count <= cb_nodes <= n_pio_tasks, with the cb_buffer_size set such that each write transaction fits entirely within the buffer. There isn't a ton of info on tuning MPI-IO out there, best place to start is the source: https://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

anton-seaice commented 9 months ago

Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here: In my experience ROMIO is very sensitive to tuning parameters. If your lustre striping, buffer sizes and aggregator settings don't line up just so, performance is barely any better than sequential writes because that's more or less what it'll be doing under the hood. It does require a bit of thought, and it very much depends on your application's output patterns. For what its worth, I recently did some MPI-IO tuning a high-resolution regional atmosphere simulation. Picking the correct MPI-IO settings improved the write performance from ~400MB/s to 2.5-3GB/s sustained to a single file. If your pio tasks aggregate data sequentially, then the general advice is set lustre_stripe_count <= cb_nodes <= n_pio_tasks, with the cb_buffer_size set such that each write transaction fits entirely within the buffer. There isn't a ton of info on tuning MPI-IO out there, best place to start is the source: https://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

Thanks Dale.

The other big caveat here is we only have a 1 degree resolution at this point, and in OM2, performance was worse with parallel IO (than without) at 1 degree but better at 0.25 degree. So it may get hard to really get into the details at this point.

Lustre stripe count is 1 (files are <100MB), but I couldn't figure out an easy way to check cb_nodes?

CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?).

Saying that, it looks like using 1 pio iotask (there are 48 PEs) and box re-arranger is fastest. With 1 pio task, box rearranger and ompio reported history time is ~12seconds (vs about 15 seconds with romio321).

(For reference: config tested )

anton-seaice commented 8 months ago

OpenMpi will fix the bug, so plan of action is

aekiss commented 8 months ago

Could also be worth discussing with Rui Yang (NCI) - he has a lot of experience with parallel IO.

aekiss commented 8 months ago

CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?).

Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks?

Also (as in OM2) we'll probably use different distribution_type, distribution_wght and processor_shape at higher resolution, probably with land block elimination (distribution_wght = block). In this case each compute PE handles a non-rectangular region - guess this makes the role of the rearranger more important?

anton-seaice commented 8 months ago

Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks?

Possibly - we will have to revisit when the chunking is working, although with neatly organised data (i.e. in 1 degree where blocks are adjacent) it might not matter. If we stick with the box rearranger, then 1 chunk per iotask is worth trying. Ofcourse we need to mindful of read patterns just as much as write speed though.

Also (as in OM2) we'll probably use different distribution_type, distribution_wght and processor_shape at higher resolution, probably with land block elimination (distribution_wght = block). In this case each compute PE handles a non-rectangular region - guess this makes the role of the rearranger more important?

Using the box rearranger - this would send all data from one compute task to one IO task - but then the data blocks would be non-contiguous in the output and need multiple calls to the netcdf library. (Presumably set netcdf chunk size = block size)

Using the subset rearranger - the data from compute tasks would be spread among multiple IO tasks - but then the data blocks would be contiguous for each IO task and require only one call to the netcdf library. (Presumably set netcdf chunk size = 1 chunk per IO task)

Box would have more IO operations and subset would have more network operations. I don't know how they would balance out (and also would guess the results are different depending if the tasks are across multiple NUMA nodes / real nodes etc).

NB: The TWG minutes talk about this a lot. Suggestion is actually that one chunk per node will be best!